Last updated February 28, 2024
In AI Origins & Evolution

How CLIP is changing computer vision as we know it

OpenAI’s DALL.E and its successor DALL.E 2, a model that generates images based on text prompts, worked in tandem with CLIP.

Share

Published on April 26, 2022

by Poulomi Chatterjee

Early last year, OpenAI released a zero-shot classifier with widespread implications, called CLIP or Contrastive Language-Image Pre-Training. CLIP broke through the traditional method of using a pre-trained model like ResNet, which involved collecting huge custom datasets of labelled images. The approach that CLIP took served to improve the generalisability of deep learning models for image classification tasks. While traditional image classifiers ignore labels, CLIP is pre-trained on over 400 million text-to-image pairs and creates an encoding of its classes. A set of classes and descriptions must be defined beforehand based on which CLIP can predict the class of an image.

Origins of CLIP

Before CLIP was released, image classification was struggling with either insufficient data to feed ML models or the cumbersome process of labelling data. Despite vast annotated datasets like ImageNet, which comprise 14 million labelled images, models failed to perform well if the datasets were even slightly tweaked. Even such vast datasets proved to be small for training models on generalisability. There were two ways to tackle this problem: either the models themselves could be improved, or the datasets could be made more diverse. CLIP attempted to revolutionise image classification via the second approach.

We’ve developed two neural networks which have learned by associating text and images. CLIP maps images into categories described in text, and DALL-E creates new images, like this, from text.

A step toward systems with deeper understanding of the world. https://t.co/rppy6u1zcn pic.twitter.com/MNVlo8LZbV
— OpenAI (@OpenAI) January 5, 2021

How does CLIP work?

In order for the image and text pairs to be connected to each other, both are embedded. A CLIP model consists of two sub-models, called encoders, including a text encoder and an image encoder. The text encoder embeds text into a mathematical space while the image encoder embeds images into a mathematical space. CLIP is then trained to predict how likely the image corresponds to the text using contrastive pre-training. When tested by OpenAI, the study showed that CLIP was four times more efficient at zero-shot ImageNet accuracy when compared to other methods.

Source: OpenAI

Range of use-cases for CLIP

Image generation: OpenAI’s DALL.E and its successor DALL.E 2, a model that generates images based on text prompts, worked in tandem with CLIP. The image classifier was used to evaluate the efficacy of the image generator. In fact, CLIP has been behind the success of several tools in the growing AI-generated art scene. CLIP helps GANs or Generative Adversarial Networks to move in the right direction.

Released in 2021, a generative model called CLIP+VQGAN or Vector Quantized Generative Adversarial Network is used within the text-to-image paradigm to generate images of variable sizes, given a set of text prompts. However, unlike VQGAN, CLIP isn’t a generative model and is simply trained to represent both images and text effectively.

Source: Research Paper

Image classification: CLIP’s ability to work with unseen datasets sometimes makes it much better than models that have been trained on specific datasets. The lesser data CLIP is trained on, the better it performs. According to a study, CLIP outperformed custom trained ResNet classification models in a task which involved classifying flowers.

Source: Oracle blog

Content moderation: Content moderation is a function within image classification. In a new approach, CLIP was proven to be able to interpret where a prompt is in embedded space. The research used cosine similarity to measure the distance between the CLIP interpretation of the prompt and the CLIP interpretation of the drawing. In the case of NSFW images, CLIP was asked to compare the CLIP interpretation of the image to the CLIP interpretation of NSFW using cosine similarity. If the cosine similarity value is bigger than a limit, the image can be classified as NSFW. This function can then be expanded across other parameters like racist imagery, hate speech or nudity.

CLIP can also be used to pick images that are corrupt or distorted. A new research paper titled, ‘Inverse Problems Leveraging Pre-trained Contrastive Representations,’ demonstrates how a supervised inversion method was used to get effective representations of corrupt images.

Source: Research Paper

Image search: Since CLIP isn’t trained on specific data, it is apt to search for large catalogues of images. Towards the end of last year, a researcher developed an AI-powered command image line search tool called rclip. Yurij Mikhalevich, the researcher, penned in his blog, “OpenAI’s CLIP takes computer image understanding to a whole new level, allowing us to search for photos using any text query we can think of.”

Image captioning: GPT-2 uses CLIP’s prefix captioning repo to produce descriptions for images. A CLIP encoding is used as a prefix to the textual captions by employing a simple MLP over the raw encoding and then fine-tuning the language model to produce a usable caption.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

‘iPhone is the Greatest Piece of Technology Humanity has Ever Made,’ Says OpenAI’s Sam Altman

Siddharth Jindal

“There are a bunch of societal and interpersonal issues that are all very complicated about wearing a computer on your face,” says OpenAI chief, taking a dig at Meta smart