Can Google’s LiT outperform OpenAI’s CLIP at image classification?

LiT is able to perform image classification without needing to be trained on every fresh dataset while also having the accuracy of specialised models.

The multimodal contrastive learning framework is one where an image model is trained consecutively with a text model. In the recent past, prominent models like OpenAI’s CLIP and Google’s ALIGN worked on this paradigm to do away with the need for extra data. These models used the zero-shot learning approach to solve new tasks by reformulating them as image-text matching problems. As flexible as contrastive learning is and effective at working on new tasks with lesser data, it has its own limitations, like the requirement for a large number for paired image-text datasets and a weaker performance than transfer learning.  

A pre-trained model has to be fine-tuned every time for a new task; a LiT model does not need to be trained further.

Prior to the introduction of multimodal learning, transfer learning helped quicken image classification. Models were first pretrained on large datasets of images using ImageNet as a benchmark and then transferred via fine-tuning to a new task with less data. While transfer learning worked perfectly well for older vision models like Big Transfer or BiT and Vision Transformer or ViT, fine-tuning took up a relatively long time as every new dataset had to be fine-tuned separately on task-specific data. GoogleAI has released a new model called LiT, or Locked-image Text Tuning, considering all these disadvantages. The model is set to be presented along with the paper titled, ‘LiT: Zero-Shot Transfer with Locked-image Text Tuning’ at this year’s CVPR conference due to be held in June. 

How does LiT work? 

Multimodal contrastive learning trains models to produce similar representations for closely matched images and texts

The model came up with a deceptively simple setup where it was able to leverage strong image representations from pre-training along with the contrastive learning framework that used zero-shot learning. The models learn to match text to a pre-trained image encoder. This differs from the previous method of multimodal training, where an image encoder learns just the image representations while the text encoder learns the corresponding text representations. With this, LiT is able to perform image classification without needing to be trained on every fresh dataset while also having the accuracy of specialised models. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
          Image and text data pairs in contrastive training

The models that haven’t been trained using contrastive pre-training learn image embeddings via a large and relatively cleaner dataset of semi-manually labelled images. The most commonly used datasets include the ImageNet-21k and JFT-300M. However, the disadvantage to using these datasets is that the model is trained to a restricted number of categories and will tend to recognise them only. Multimodal data does not carry this limitation as the model is trained on a free-form text that includes a wide range of categories. On the flipside, carefully curated datasets may have better quality data than image-text data, which is usually of lower quality. 

Contrastive pre-training is initialised with an image model pre-trained using relatively cleaner semi-manually labelled data. Here, the image-text alignment is learned independently of image embedding.

Contrastive learning on image-text data

LiT training uses contrastive learning for a text encoder to match a pretrained image encoder. 

The model learns representations from a set of ‘negative’ and ‘positive’ examples so that the representations for positive examples are similar while being different from the representations for the negative examples. The model was trained using datasets that weren’t necessarily clean and occurred more naturally online. This enabled the model to become robust because it understood the visual concept fully. Once training is done, the model is able to align text and image to solve many problems. 

LiT is a perfect mix as it employs the accuracy of ImageNet classification using transfer learning is – it stands at 90.94 per cent, as compared to the best contrastive zero-shot models that achieve 76.4 per cent. Also, the pretrained image encoder must be ‘locked’ so that it is not updated during training. 


It was found that the LiT model achieved 84.5 per cent zero-shot accuracy with ImageNet classification, showing a marked improvement and halving the performance gap between fine-tuning and contrastive learning.

Performance of LiT as compared to the best contrastive models and the best models fine-tuned with labels

The model’s performance was compared to the older state-of-the-art models like CLIP and ALIGN on the basis of seven VTAB tasks. LiT outperformed CLIP and ALIGN by 8.3 per cent and 8.1 per cent, respectively, at image classification tasks. At the same time, CLIP reached 72.3 per cent accuracy in the ObjectNet benchmark. 

Advantages of LiT over older vision models

  • Being a contrastive model, LiT displayed high levels of accuracy with datasets that fool fine-tuned models like ObjectNet and ImageNet-C. 
  • Considering other models with a contrastive approach, LiT uses far fewer amounts of data. An older model based on zero-shot classification has to be trained on 400 million image-text pairs of private data to be equal to LiT, which is trained on 24 million image-text freely available pairs. 
  • A locked image encoder is another essential element for quick training and leaves behind a smaller footprint of memory. 
  • Image representation can be pre-computed for large datasets and allows for training even larger batches. 
  • Contrastive training works well for several other types of training, like self-supervised learning and many other models that are available freely. 
Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.