Google AI Launches SimVLM, A Simple Visual Language Model Pre-training with Weak Supervision

The new approach obtains not only state-of-the-art performance but also exhibits intriguing zero-shot behaviors in multimodal understanding tasks.

Many visual language modelling methods have recently emerged as a feasible option for content-based image classification. In such a method, each image is converted into a matrix of visual words, and each visual word is assumed to be conditionally reliant on its neighbours.

While there have been various challenges to such cross-modal work, significant progress has also been made in the past few years on vision-language modelling, thanks to the adoption of effective vision-language pre-training (VLP). VLPs aim to learn a single feature space from both visual and language inputs, rather than learning two separate feature spaces, one for visual and another for language inputs. 

Existing VLP frequently uses an object detector trained on labelled object detection datasets to extract regions-of-interest (ROI) and task-specific techniques (i.e., task-specific loss functions) to learn picture and text representations simultaneously. However, such approaches are less scalable because they require annotated datasets and time to build task-specific methods.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

To address such an issue, researchers at Google suggest a simple yet effective VLP model called SimVLM, which stands for “Simple Visual Language Model Pre-training with Weak Supervision.” SimVLM is trained on a large number of poorly aligned image-text pairs end-to-end with a unifying purpose comparable to language modelling. SimVLM’s ease of use allows for efficient training on such a large dataset, allowing the model to achieve best-in-class performance across six vision-language benchmarks.

SimVLM uses a sequence-to-sequence framework and is trained with one prefix language model (PrefixLM). PrefixLM receives the leading part of a sequence (the prefix) as inputs and predicts its continuation. In SimVLM, the prefix concatenates both the image patch sequence and the prefix text sequence received by the encoder for multimodal inputs (e.g., images and captions). The decoder then predicts how the textual sequence will continue.

Download our Mobile App

Image: Google AI

Unlike previous ROI-based VLP techniques, it allows the model to take in raw images as inputs directly. Furthermore, the researchers have used a convolution stage consisting of the first three blocks of ResNet to extract contextualized patches.

As the researchers trained SimVLM on a considerable quantity of data from visual and textual modalities, they also investigated whether it can accomplish the zero-shot cross-modality transfer on various tasks. This included picture captioning, multilingual captioning, open-ended VQA, and visual text completion. The pre-trained SimVLM was employed to decode multimodal inputs directly, with only text data fine-tuning or no fine-tuning at all. The results showed that the model can produce high-quality image captions and descriptions, allowing for cross-lingual and cross-modality translation.

Image: Google AI

The Google AI Research team claims that the current model is trained end-to-end with a single prefix language model objective, unlike the prior work that used object detection models and task-specific auxiliary losses. The new approach obtains not only state-of-the-art performance but also exhibits intriguing zero-shot behaviours in multimodal understanding tasks.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

The Great Indian IT Reshuffling

While both the top guns of TCS and Tech Mahindra are reflecting rather positive signs to the media, the reason behind the resignations is far more grave.

OpenAI, a Data Scavenging Company for Microsoft

While it might be true that the investment was for furthering AI research, this partnership is also providing Microsoft with one of the greatest assets of this digital age, data​​, and—perhaps to make it worse—that data might be yours.