Many visual language modelling methods have recently emerged as a feasible option for content-based image classification. In such a method, each image is converted into a matrix of visual words, and each visual word is assumed to be conditionally reliant on its neighbours.
While there have been various challenges to such cross-modal work, significant progress has also been made in the past few years on vision-language modelling, thanks to the adoption of effective vision-language pre-training (VLP). VLPs aim to learn a single feature space from both visual and language inputs, rather than learning two separate feature spaces, one for visual and another for language inputs.
Existing VLP frequently uses an object detector trained on labelled object detection datasets to extract regions-of-interest (ROI) and task-specific techniques (i.e., task-specific loss functions) to learn picture and text representations simultaneously. However, such approaches are less scalable because they require annotated datasets and time to build task-specific methods.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
To address such an issue, researchers at Google suggest a simple yet effective VLP model called SimVLM, which stands for “Simple Visual Language Model Pre-training with Weak Supervision.” SimVLM is trained on a large number of poorly aligned image-text pairs end-to-end with a unifying purpose comparable to language modelling. SimVLM’s ease of use allows for efficient training on such a large dataset, allowing the model to achieve best-in-class performance across six vision-language benchmarks.
SimVLM uses a sequence-to-sequence framework and is trained with one prefix language model (PrefixLM). PrefixLM receives the leading part of a sequence (the prefix) as inputs and predicts its continuation. In SimVLM, the prefix concatenates both the image patch sequence and the prefix text sequence received by the encoder for multimodal inputs (e.g., images and captions). The decoder then predicts how the textual sequence will continue.
Download our Mobile App

Unlike previous ROI-based VLP techniques, it allows the model to take in raw images as inputs directly. Furthermore, the researchers have used a convolution stage consisting of the first three blocks of ResNet to extract contextualized patches.
As the researchers trained SimVLM on a considerable quantity of data from visual and textual modalities, they also investigated whether it can accomplish the zero-shot cross-modality transfer on various tasks. This included picture captioning, multilingual captioning, open-ended VQA, and visual text completion. The pre-trained SimVLM was employed to decode multimodal inputs directly, with only text data fine-tuning or no fine-tuning at all. The results showed that the model can produce high-quality image captions and descriptions, allowing for cross-lingual and cross-modality translation.

The Google AI Research team claims that the current model is trained end-to-end with a single prefix language model objective, unlike the prior work that used object detection models and task-specific auxiliary losses. The new approach obtains not only state-of-the-art performance but also exhibits intriguing zero-shot behaviours in multimodal understanding tasks.