Master of all trades: Meet Flamingo, DeepMind’s visual language model

Flamingo’s ability to handle interwoven text and visuals makes it a natural fit for in-context few-shot learning, similar to GPT-3, which also used few-shot text prompting.

Published on May 5, 2022

by Poulomi Chatterjee

Among computer vision models, the training method that is used most often is to first pretrain the model on a huge amount of multimodal data from the internet and then fine-tune it on the basis of the task to be performed. Fine-tuning is a resource-intensive process that involves tuning and annotating thousands of data points. This was until recently, when multimodal vision language models were trained with a contrastive objective. These models were trained with a zero-shot learning approach to perform entirely new tasks, removing the need for fine-tuning.

Source: Research paper, Examples of inputs and outputs from Flamingo

These models, however, only offer a similarity score between a text and an image, indicating they can only be useful in a limited set of cases, like classification, where a definite batch of outcomes is given beforehand. Such models are not effective at generating language, which includes tasks like captioning and visual question-answering. Visually conditioned language generation models have also been used, but not with a lot of success. Last week, DeepMind introduced a visual language model called Flamingo that employs few-shot learning on a range of open-ended vision and language tasks by a few prompts. The research team conducted a study testing the model on a range of multimodal tasks.

Building and training

Source: Research paper, Examples of interwoven text and visuals

Flamingo was fed with text interwoven with videos and images, due to which it is capable of handling a diverse range of tasks. The model can work on both open-ended tasks like captioning and visual question answers, which work on text generation, and also close-ended tasks like classification, which is to select the best category or answer within a given set. More importantly, Flamingo is trained with a few-shot learning approach where examples of annotated visual and text pairs are interwoven together and fed as training prompts without making any changes to the model’s weights.

Flamingo was trained using pretrained models to avoid spending computing power training a model from scratch. For the vision part, the researchers pretrained a vision encoder with a contrastive text-image approach similar to CLIP. The model then works to extract semantic spatial features that would describe the attributes that could appear as a query for a visual piece of information: colour, shape, position and nature. For the language part, an existing autoregressive language model was employed that was trained on a large and rich text corpus.

Source: Research paper, Flamingo model’s architecture

The large amount of information stored in the language model weights is how Flamingo is able to gain its strong language generative quality. These two pretrained models are then interconnected through two learnable architectures. The weights of the two models are frozen so that their initial capacity stays the same. First, the Vision Encoder sends spatio-temporal features to the Perceiver Resampler, obtained through either images or video and produces a fixed size-set of visual tokens as the output. In the next step, these visual tokens are used to condition the frozen language model using cross attention layers that are interwoven between the pretrained language model layers. This is a new way for the language model to absorb visual information for the next-token prediction task.

To train the model on a new task, alternating visual inputs and text responses are followed by one last test video or image. After this prompt is given, either output text can be sampled or the probability of a fixed set of completions is evaluated. Flamingo’s ability to handle interwoven text and visuals makes it a natural fit for in-context few-shot learning, similar to GPT-3, which also used few-shot text prompting. DeepMind’s recently released large language model, the 70 billion parameter Chinchilla, was used as the base model for the largest Flamingo model.

Findings

There were three models of Flamingo obtained: a 3 billion model built on top of a 1.4 billion frozen language model, a 9 billion model built on a 7 billion frozen language model, and an 80 billion model built on the frozen 70 billion Chinchilla model. The research tested Flamingo on the basis of 16 tasks, at which it ended up outperforming previous few-shot learning approaches, even with four examples given for a task.

Source: DeepMind blog, Flamingo engaging in a multimodal conversation and passing the Stroop test

Apart from testing Flamingo on the current benchmarks, its performance was compared qualitatively when it came to captioning images related to gender and skin colour. The captions generated by Flamingo were also run through Google’s Perspective API to evaluate its toxicity levels. The study also demonstrated qualitative examples showing interesting interactive abilities, like being able to “chat” with the model and asking queries on random information about input images and videos. Flamingo proved to be flexible and could potentially be a link between large language models and visual representations progressing towards general-purpose visual understanding.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.