CLIP vs Vision Language Pre-training Vs VisionEncoderDecoder

These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

One of the many technological advancements in the field of language models are vision and language variant models. Major tech and research companies have come up with such models, such as OpenAI’s CLIP, Hugging Face’s VisionEncoderDecoder and VLP (Vision Language Pre-training). These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

Recently, Prithivi Damodaran, AVP ML R&D at Antworks, who frequently posts about concepts in data science, AI, and NLP, recently posted on LinkedIn on how to understand which one works for you and what to apply in which situation. Let’s try to understand what CLIP, VLP (Vision Language Pre-training) and VisionEncoderDecoder are and what makes each of these unique.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Image: LinkedIn

What is CLIP from OpenAI

Multimodal learning

Released in January last year, Contrastive Language–Image Pre-training, or CLIP, is built on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. OpenAI showed that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a wide range of image classification datasets. This method uses available sources of supervision – the text paired with images found on the internet. The data is used to proxy training tasks for CLIP where given an image, it predicts out of a set of 32,768 randomly sampled text snippets, which was actually paired with it in their dataset.

Download our Mobile App

To do this, OpenAI said, CLIP models, will have to learn to recognise a huge variety of visual concepts in images and associate them with their names. Then, they can be applied to nearly arbitrary visual classification tasks.

Advantages of using CLIP

OpenAI said that it designed CLIP to solve various issues that exist in deep learning methods in computer vision

  • Datasets are costly – As CLIP learns from text–image pairs that are already available publicly, it reduces the need for expensive large labelled datasets.
  • Narrow – CLIP can perform various visual classification tasks without requiring additional training examples. To do this, one has to “tell” CLIP’s text-encoder the names of the task’s visual concepts. Then, it will output a linear classifier of CLIP’s visual representations.

Hugging Face’s VisionEncoderDecoderModel

Multimodal frame

Hugging Face’s VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture by having one of the base vision model classes of the library as the encoder and another one as the decoder. Hugging Face says that it can be used to initialise an image-to-text-sequence model with any pretrained vision autoencoding model (ViT, BEiT, DeiT) as the encoder and any pretrained language model (RoBERTa, GPT2, BERT) as the decoder.

After such a Vision-Encoder-Text-Decoder model has been trained or fine-tuned, it can be saved/loaded just like any other model.

VLP (Vision Language Pre-training)

Mixed-modal frame

Damodaran says that Unified VLP models are typically pre-trained on a large number of image-text pairs with “creative” self-supervised objectives and loss functions. It can give better vision and language alignment as compared to using vision encoder and language decoder that is trained in isolation.

To understand better what VLP means, in the paper titled “Unified Vision-Language Pre-Training for Image Captioning and VQA”, the authors mention that the model is unified by fine-tuning for either vision-language generation or understanding. It uses a shared multi-layer transformer network for both encoding and decoding instead of methods where encoder and decoder are integrated using separate models. 

The unified VLP model is pre-trained on a large number of image-text pairs using the unsupervised learning objectives of bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. 

The team added that VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks (image captioning and visual question answering) across three benchmark datasets of COCO Captions, Flickr30k Captions, and VQA 2.0.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.