How CLIP is changing computer vision as we know it

OpenAI’s DALL.E and its successor DALL.E 2, a model that generates images based on text prompts, worked in tandem with CLIP.

Early last year, OpenAI released a zero-shot classifier with widespread implications, called CLIP or Contrastive Language-Image Pre-Training. CLIP broke through the traditional method of using a pre-trained model like ResNet, which involved collecting huge custom datasets of labelled images. The approach that CLIP took served to improve the generalisability of deep learning models for image classification tasks. While traditional image classifiers ignore labels, CLIP is pre-trained on over 400 million text-to-image pairs and creates an encoding of its classes. A set of classes and descriptions must be defined beforehand based on which CLIP can predict the class of an image. 

Origins of CLIP

Before CLIP was released, image classification was struggling with either insufficient data to feed ML models or the cumbersome process of labelling data. Despite vast annotated datasets like ImageNet, which comprise 14 million labelled images, models failed to perform well if the datasets were even slightly tweaked. Even such vast datasets proved to be small for training models on generalisability. There were two ways to tackle this problem: either the models themselves could be improved, or the datasets could be made more diverse. CLIP attempted to revolutionise image classification via the second approach. 

How does CLIP work? 

In order for the image and text pairs to be connected to each other, both are embedded. A CLIP model consists of two sub-models, called encoders, including a text encoder and an image encoder. The text encoder embeds text into a mathematical space while the image encoder embeds images into a mathematical space. CLIP is then trained to predict how likely the image corresponds to the text using contrastive pre-training. When tested by OpenAI, the study showed that CLIP was four times more efficient at zero-shot ImageNet accuracy when compared to other methods. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Source: OpenAI

Range of use-cases for CLIP

Image generation: OpenAI’s DALL.E and its successor DALL.E 2, a model that generates images based on text prompts, worked in tandem with CLIP. The image classifier was used to evaluate the efficacy of the image generator. In fact, CLIP has been behind the success of several tools in the growing AI-generated art scene. CLIP helps GANs or Generative Adversarial Networks to move in the right direction. 

Released in 2021, a generative model called CLIP+VQGAN or Vector Quantized Generative Adversarial Network is used within the text-to-image paradigm to generate images of variable sizes, given a set of text prompts. However, unlike VQGAN, CLIP isn’t a generative model and is simply trained to represent both images and text effectively. 

                Source: Research Paper

Image classification: CLIP’s ability to work with unseen datasets sometimes makes it much better than models that have been trained on specific datasets. The lesser data CLIP is trained on, the better it performs. According to a study, CLIP outperformed custom trained ResNet classification models in a task which involved classifying flowers. 

    Source: Oracle blog

Content moderation: Content moderation is a function within image classification. In a new approach, CLIP was proven to be able to interpret where a prompt is in embedded space. The research used cosine similarity to measure the distance between the CLIP interpretation of the prompt and the CLIP interpretation of the drawing. In the case of NSFW images, CLIP was asked to compare the CLIP interpretation of the image to the CLIP interpretation of NSFW using cosine similarity. If the cosine similarity value is bigger than a limit, the image can be classified as NSFW. This function can then be expanded across other parameters like racist imagery, hate speech or nudity. 

CLIP can also be used to pick images that are corrupt or distorted. A new research paper titled, ‘Inverse Problems Leveraging Pre-trained Contrastive Representations,’ demonstrates how a supervised inversion method was used to get effective representations of corrupt images.

            Source: Research Paper

Image search: Since CLIP isn’t trained on specific data, it is apt to search for large catalogues of images. Towards the end of last year, a researcher developed an AI-powered command image line search tool called rclip. Yurij Mikhalevich, the researcher, penned in his blog, “OpenAI’s CLIP takes computer image understanding to a whole new level, allowing us to search for photos using any text query we can think of.” 

Image captioning: GPT-2 uses CLIP’s prefix captioning repo to produce descriptions for images. A CLIP encoding is used as a prefix to the textual captions by employing a simple MLP over the raw encoding and then fine-tuning the language model to produce a usable caption. 

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox