The multimodal contrastive learning framework is one where an image model is trained consecutively with a text model. In the recent past, prominent models like OpenAI’s CLIP and Google’s ALIGN worked on this paradigm to do away with the need for extra data. These models used the zero-shot learning approach to solve new tasks by reformulating them as image-text matching problems. As flexible as contrastive learning is and effective at working on new tasks with lesser data, it has its own limitations, like the requirement for a large number for paired image-text datasets and a weaker performance than transfer learning.
Prior to the introduction of multimodal learning, transfer learning helped quicken image classification. Models were first pretrained on large datasets of images using ImageNet as a benchmark and then transferred via fine-tuning to a new task with less data. While transfer learning worked perfectly well for older vision models like Big Transfer or BiT and Vision Transformer or ViT, fine-tuning took up a relatively long time as every new dataset had to be fine-tuned separately on task-specific data. GoogleAI has released a new model called LiT, or Locked-image Text Tuning, considering all these disadvantages. The model is set to be presented along with the paper titled, ‘LiT: Zero-Shot Transfer with Locked-image Text Tuning’ at this year’s CVPR conference due to be held in June.
How does LiT work?
The model came up with a deceptively simple setup where it was able to leverage strong image representations from pre-training along with the contrastive learning framework that used zero-shot learning. The models learn to match text to a pre-trained image encoder. This differs from the previous method of multimodal training, where an image encoder learns just the image representations while the text encoder learns the corresponding text representations. With this, LiT is able to perform image classification without needing to be trained on every fresh dataset while also having the accuracy of specialised models.
The models that haven’t been trained using contrastive pre-training learn image embeddings via a large and relatively cleaner dataset of semi-manually labelled images. The most commonly used datasets include the ImageNet-21k and JFT-300M. However, the disadvantage to using these datasets is that the model is trained to a restricted number of categories and will tend to recognise them only. Multimodal data does not carry this limitation as the model is trained on a free-form text that includes a wide range of categories. On the flipside, carefully curated datasets may have better quality data than image-text data, which is usually of lower quality.
Contrastive pre-training is initialised with an image model pre-trained using relatively cleaner semi-manually labelled data. Here, the image-text alignment is learned independently of image embedding.
Contrastive learning on image-text data
The model learns representations from a set of ‘negative’ and ‘positive’ examples so that the representations for positive examples are similar while being different from the representations for the negative examples. The model was trained using datasets that weren’t necessarily clean and occurred more naturally online. This enabled the model to become robust because it understood the visual concept fully. Once training is done, the model is able to align text and image to solve many problems.
LiT is a perfect mix as it employs the accuracy of ImageNet classification using transfer learning is – it stands at 90.94 per cent, as compared to the best contrastive zero-shot models that achieve 76.4 per cent. Also, the pretrained image encoder must be ‘locked’ so that it is not updated during training.
It was found that the LiT model achieved 84.5 per cent zero-shot accuracy with ImageNet classification, showing a marked improvement and halving the performance gap between fine-tuning and contrastive learning.
The model’s performance was compared to the older state-of-the-art models like CLIP and ALIGN on the basis of seven VTAB tasks. LiT outperformed CLIP and ALIGN by 8.3 per cent and 8.1 per cent, respectively, at image classification tasks. At the same time, CLIP reached 72.3 per cent accuracy in the ObjectNet benchmark.
Advantages of LiT over older vision models
- Being a contrastive model, LiT displayed high levels of accuracy with datasets that fool fine-tuned models like ObjectNet and ImageNet-C.
- Considering other models with a contrastive approach, LiT uses far fewer amounts of data. An older model based on zero-shot classification has to be trained on 400 million image-text pairs of private data to be equal to LiT, which is trained on 24 million image-text freely available pairs.
- A locked image encoder is another essential element for quick training and leaves behind a smaller footprint of memory.
- Image representation can be pre-computed for large datasets and allows for training even larger batches.
- Contrastive training works well for several other types of training, like self-supervised learning and many other models that are available freely.