CLIP vs Vision Language Pre-training Vs VisionEncoderDecoder

These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

One of the many technological advancements in the field of language models are vision and language variant models. Major tech and research companies have come up with such models, such as OpenAI’s CLIP, Hugging Face’s VisionEncoderDecoder and VLP (Vision Language Pre-training). These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

Recently, Prithivi Damodaran, AVP ML R&D at Antworks, who frequently posts about concepts in data science, AI, and NLP, recently posted on LinkedIn on how to understand which one works for you and what to apply in which situation. Let’s try to understand what CLIP, VLP (Vision Language Pre-training) and VisionEncoderDecoder are and what makes each of these unique.


Sign up for your weekly dose of what's up in emerging technology.

Image: LinkedIn

What is CLIP from OpenAI

Multimodal learning

Released in January last year, Contrastive Language–Image Pre-training, or CLIP, is built on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. OpenAI showed that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a wide range of image classification datasets. This method uses available sources of supervision – the text paired with images found on the internet. The data is used to proxy training tasks for CLIP where given an image, it predicts out of a set of 32,768 randomly sampled text snippets, which was actually paired with it in their dataset.

To do this, OpenAI said, CLIP models, will have to learn to recognise a huge variety of visual concepts in images and associate them with their names. Then, they can be applied to nearly arbitrary visual classification tasks.

Advantages of using CLIP

OpenAI said that it designed CLIP to solve various issues that exist in deep learning methods in computer vision

  • Datasets are costly – As CLIP learns from text–image pairs that are already available publicly, it reduces the need for expensive large labelled datasets.
  • Narrow – CLIP can perform various visual classification tasks without requiring additional training examples. To do this, one has to “tell” CLIP’s text-encoder the names of the task’s visual concepts. Then, it will output a linear classifier of CLIP’s visual representations.

Hugging Face’s VisionEncoderDecoderModel

Multimodal frame

Hugging Face’s VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture by having one of the base vision model classes of the library as the encoder and another one as the decoder. Hugging Face says that it can be used to initialise an image-to-text-sequence model with any pretrained vision autoencoding model (ViT, BEiT, DeiT) as the encoder and any pretrained language model (RoBERTa, GPT2, BERT) as the decoder.

After such a Vision-Encoder-Text-Decoder model has been trained or fine-tuned, it can be saved/loaded just like any other model.

VLP (Vision Language Pre-training)

Mixed-modal frame

Damodaran says that Unified VLP models are typically pre-trained on a large number of image-text pairs with “creative” self-supervised objectives and loss functions. It can give better vision and language alignment as compared to using vision encoder and language decoder that is trained in isolation.

To understand better what VLP means, in the paper titled “Unified Vision-Language Pre-Training for Image Captioning and VQA”, the authors mention that the model is unified by fine-tuning for either vision-language generation or understanding. It uses a shared multi-layer transformer network for both encoding and decoding instead of methods where encoder and decoder are integrated using separate models. 

The unified VLP model is pre-trained on a large number of image-text pairs using the unsupervised learning objectives of bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. 

The team added that VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks (image captioning and visual question answering) across three benchmark datasets of COCO Captions, Flickr30k Captions, and VQA 2.0.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM