Last updated February 3, 2022
In AI Origins & Evolution

CLIP vs Vision Language Pre-training Vs VisionEncoderDecoder

These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

Published on January 27, 2022
by Sreejani Bhattacharyya

One of the many technological advancements in the field of language models are vision and language variant models. Major tech and research companies have come up with such models, such as OpenAI’s CLIP, Hugging Face’s VisionEncoderDecoder and VLP (Vision Language Pre-training). These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.

Recently, Prithivi Damodaran, AVP ML R&D at Antworks, who frequently posts about concepts in data science, AI, and NLP, recently posted on LinkedIn on how to understand which one works for you and what to apply in which situation. Let’s try to understand what CLIP, VLP (Vision Language Pre-training) and VisionEncoderDecoder are and what makes each of these unique.

Image: LinkedIn

What is CLIP from OpenAI

Multimodal learning

Released in January last year, Contrastive Language–Image Pre-training, or CLIP, is built on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. OpenAI showed that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a wide range of image classification datasets. This method uses available sources of supervision – the text paired with images found on the internet. The data is used to proxy training tasks for CLIP where given an image, it predicts out of a set of 32,768 randomly sampled text snippets, which was actually paired with it in their dataset.

To do this, OpenAI said, CLIP models, will have to learn to recognise a huge variety of visual concepts in images and associate them with their names. Then, they can be applied to nearly arbitrary visual classification tasks.

Advantages of using CLIP

OpenAI said that it designed CLIP to solve various issues that exist in deep learning methods in computer vision

Datasets are costly – As CLIP learns from text–image pairs that are already available publicly, it reduces the need for expensive large labelled datasets.

Narrow – CLIP can perform various visual classification tasks without requiring additional training examples. To do this, one has to “tell” CLIP’s text-encoder the names of the task’s visual concepts. Then, it will output a linear classifier of CLIP’s visual representations.

Hugging Face’s VisionEncoderDecoderModel

Multimodal frame

Hugging Face’s VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture by having one of the base vision model classes of the library as the encoder and another one as the decoder. Hugging Face says that it can be used to initialise an image-to-text-sequence model with any pretrained vision autoencoding model (ViT, BEiT, DeiT) as the encoder and any pretrained language model (RoBERTa, GPT2, BERT) as the decoder.

After such a Vision-Encoder-Text-Decoder model has been trained or fine-tuned, it can be saved/loaded just like any other model.

VLP (Vision Language Pre-training)

Damodaran says that Unified VLP models are typically pre-trained on a large number of image-text pairs with “creative” self-supervised objectives and loss functions. It can give better vision and language alignment as compared to using vision encoder and language decoder that is trained in isolation.

To understand better what VLP means, in the paper titled “Unified Vision-Language Pre-Training for Image Captioning and VQA”, the authors mention that the model is unified by fine-tuning for either vision-language generation or understanding. It uses a shared multi-layer transformer network for both encoding and decoding instead of methods where encoder and decoder are integrated using separate models.

The unified VLP model is pre-trained on a large number of image-text pairs using the unsupervised learning objectives of bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction.

The team added that VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks (image captioning and visual question answering) across three benchmark datasets of COCO Captions, Flickr30k Captions, and VQA 2.0.

Access all our open Survey & Awards Nomination forms in one place >>

Sreejani Bhattacharyya

I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

CLIP vs Vision Language Pre-training Vs VisionEncoderDecoder

What is CLIP from OpenAI

Multimodal learning

Advantages of using CLIP

Hugging Face’s VisionEncoderDecoderModel

Multimodal frame

VLP (Vision Language Pre-training)

Mixed-modal frame

Sreejani Bhattacharyya

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.