Among computer vision models, the training method that is used most often is to first pretrain the model on a huge amount of multimodal data from the internet and then fine-tune it on the basis of the task to be performed. Fine-tuning is a resource-intensive process that involves tuning and annotating thousands of data points. This was until recently, when multimodal vision language models were trained with a contrastive objective. These models were trained with a zero-shot learning approach to perform entirely new tasks, removing the need for fine-tuning.
Source: Research paper, Examples of inputs and outputs from Flamingo
These models, however, only offer a similarity score between a text and an image, indicating they can only be useful in a limited set of cases, like classification, where a definite batch of outcomes is given beforehand. Such models are not effective at generating language, which includes tasks like captioning and visual question-answering. Visually conditioned language generation models have also been used, but not with a lot of success. Last week, DeepMind introduced a visual language model called Flamingo that employs few-shot learning on a range of open-ended vision and language tasks by a few prompts. The research team conducted a study testing the model on a range of multimodal tasks.
Building and training
Source: Research paper, Examples of interwoven text and visuals
Flamingo was fed with text interwoven with videos and images, due to which it is capable of handling a diverse range of tasks. The model can work on both open-ended tasks like captioning and visual question answers, which work on text generation, and also close-ended tasks like classification, which is to select the best category or answer within a given set. More importantly, Flamingo is trained with a few-shot learning approach where examples of annotated visual and text pairs are interwoven together and fed as training prompts without making any changes to the model’s weights.
Flamingo was trained using pretrained models to avoid spending computing power training a model from scratch. For the vision part, the researchers pretrained a vision encoder with a contrastive text-image approach similar to CLIP. The model then works to extract semantic spatial features that would describe the attributes that could appear as a query for a visual piece of information: colour, shape, position and nature. For the language part, an existing autoregressive language model was employed that was trained on a large and rich text corpus.
Source: Research paper, Flamingo model’s architecture
The large amount of information stored in the language model weights is how Flamingo is able to gain its strong language generative quality. These two pretrained models are then interconnected through two learnable architectures. The weights of the two models are frozen so that their initial capacity stays the same. First, the Vision Encoder sends spatio-temporal features to the Perceiver Resampler, obtained through either images or video and produces a fixed size-set of visual tokens as the output. In the next step, these visual tokens are used to condition the frozen language model using cross attention layers that are interwoven between the pretrained language model layers. This is a new way for the language model to absorb visual information for the next-token prediction task.
To train the model on a new task, alternating visual inputs and text responses are followed by one last test video or image. After this prompt is given, either output text can be sampled or the probability of a fixed set of completions is evaluated. Flamingo’s ability to handle interwoven text and visuals makes it a natural fit for in-context few-shot learning, similar to GPT-3, which also used few-shot text prompting. DeepMind’s recently released large language model, the 70 billion parameter Chinchilla, was used as the base model for the largest Flamingo model.
Findings
There were three models of Flamingo obtained: a 3 billion model built on top of a 1.4 billion frozen language model, a 9 billion model built on a 7 billion frozen language model, and an 80 billion model built on the frozen 70 billion Chinchilla model. The research tested Flamingo on the basis of 16 tasks, at which it ended up outperforming previous few-shot learning approaches, even with four examples given for a task.
Source: DeepMind blog, Flamingo engaging in a multimodal conversation and passing the Stroop test
Apart from testing Flamingo on the current benchmarks, its performance was compared qualitatively when it came to captioning images related to gender and skin colour. The captions generated by Flamingo were also run through Google’s Perspective API to evaluate its toxicity levels. The study also demonstrated qualitative examples showing interesting interactive abilities, like being able to “chat” with the model and asking queries on random information about input images and videos. Flamingo proved to be flexible and could potentially be a link between large language models and visual representations progressing towards general-purpose visual understanding.