CLIP vs Vision Language Pre-training Vs VisionEncoderDecoder

These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.


One of the many technological advancements in the field of language models are vision and language variant models. Major tech and research companies have come up with such models, such as OpenAI’s CLIP, Hugging Face’s VisionEncoderDecoder and VLP (Vision Language Pre-training). These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

Recently, Prithivi Damodaran, AVP ML R&D at Antworks, who frequently posts about concepts in data science, AI, and NLP, recently posted on LinkedIn on how to understand which one works for you and what to apply in which situation. Let’s try to understand what CLIP, VLP (Vision Language Pre-training) and VisionEncoderDecoder are and what makes each of these unique.


Sign up for your weekly dose of what's up in emerging technology.

Image: LinkedIn

What is CLIP from OpenAI

Multimodal learning

Released in January last year, Contrastive Language–Image Pre-training, or CLIP, is built on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. OpenAI showed that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a wide range of image classification datasets. This method uses available sources of supervision – the text paired with images found on the internet. The data is used to proxy training tasks for CLIP where given an image, it predicts out of a set of 32,768 randomly sampled text snippets, which was actually paired with it in their dataset.

To do this, OpenAI said, CLIP models, will have to learn to recognise a huge variety of visual concepts in images and associate them with their names. Then, they can be applied to nearly arbitrary visual classification tasks.

Advantages of using CLIP

OpenAI said that it designed CLIP to solve various issues that exist in deep learning methods in computer vision

  • Datasets are costly – As CLIP learns from text–image pairs that are already available publicly, it reduces the need for expensive large labelled datasets.
  • Narrow – CLIP can perform various visual classification tasks without requiring additional training examples. To do this, one has to “tell” CLIP’s text-encoder the names of the task’s visual concepts. Then, it will output a linear classifier of CLIP’s visual representations.

Hugging Face’s VisionEncoderDecoderModel

Multimodal frame

Hugging Face’s VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture by having one of the base vision model classes of the library as the encoder and another one as the decoder. Hugging Face says that it can be used to initialise an image-to-text-sequence model with any pretrained vision autoencoding model (ViT, BEiT, DeiT) as the encoder and any pretrained language model (RoBERTa, GPT2, BERT) as the decoder.

After such a Vision-Encoder-Text-Decoder model has been trained or fine-tuned, it can be saved/loaded just like any other model.

VLP (Vision Language Pre-training)

Mixed-modal frame

Damodaran says that Unified VLP models are typically pre-trained on a large number of image-text pairs with “creative” self-supervised objectives and loss functions. It can give better vision and language alignment as compared to using vision encoder and language decoder that is trained in isolation.

To understand better what VLP means, in the paper titled “Unified Vision-Language Pre-Training for Image Captioning and VQA”, the authors mention that the model is unified by fine-tuning for either vision-language generation or understanding. It uses a shared multi-layer transformer network for both encoding and decoding instead of methods where encoder and decoder are integrated using separate models. 

The unified VLP model is pre-trained on a large number of image-text pairs using the unsupervised learning objectives of bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. 

The team added that VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks (image captioning and visual question answering) across three benchmark datasets of COCO Captions, Flickr30k Captions, and VQA 2.0.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
Amit Raja Naik
Oh boy, is JP Morgan wrong?

The global brokerage firm has downgraded Tata Consultancy Services, HCL Technology, Wipro, and L&T Technology to ‘underweight’ from ‘neutral’ and slashed its target price by 15-21 per cent.