MITB Banner

OpenAI’s Breakthrough Model Gets ‘CLIP’ped Thanks To Distillation Technique

Using CLIP, OpenAI demonstrated that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance.

Share

OpenAI has garnered a reputation for dishing out state-of-the-art models that can be immediately commercialised through their private beta releases via in-house APIs. After the super successful GPT-3 release last year, the Microsoft partnered company has released CLIP, a neural network, to efficiently learn visual concepts from natural language supervision. All one has to do is simply provide the names of the visual categories to be recognised, similar to the “zero-shot” capabilities of GPT-2 and GPT-3, and apply the CLIP model to any visual classification benchmark. 

Both CLIP and DALL.E models of OpenAI are products of the ML community’s relentless efforts to combine the advantages of both vision and language models. OpenAI’s co-founder Ilya Sutskever, too, has stressed their importance going forward. With CLIP, the company tried to address one of the most pressing questions that still bother the community: “Are these benchmark smashing, “expensive” data devouring models restricted to big labs?” Smaller organisations and individual researchers would like to get their hands on these models, experiment, and come up with innovations of their own. Though the private beta release and APIs solve this problem to an extent, there are still debates around this. 

Now, a team at PicCollage has come up with an even more compact version of CLIP. The image-editing app maker has recently claimed to make a lighter version of OpenAI’s famed CLIP model and even run it effectively on iOS. To do this, the team used model distillation to reduce the size of the CLIP model (the ViT model) and got promising results. “Given the magnitude of the dataset and compute required, it seemed like a daunting task, but we wanted to give it a shot anyway,” wrote the team in their blog post.

A Brief Overview Of CLIP

(Source: OpenAI)

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. Using CLIP, OpenAI demonstrated that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets.

As illustrated above, the CLIP model pre-trains an image encoder and a text encoder to predict which images were paired with which texts in OpenAI’s dataset. This behaviour is then used to turn CLIP into a zero-shot classifier. Finally, we convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

Clipping For Edge

According to PicCollage, the objective of their experiment was to use the distillation paradigm to reduce the size of CLIP and explore the possibility of deploying it on edge. The sizes of the original and the distilled models are 350MB and 48MB (24MB) with FP32 (FP16) precision, respectively. “Distilled models are converted to CoreML format to run on iOS and observed negligible difference between the search results of FP16 and FP32 versions,” wrote Vinay Sisodia, ML Engineer at PicCollage.

Teacher-student framework for distillation via Gou et al.,

Distillation techniques were designed to reduce the complexity of the networks by targeting depth and width. Typical deeper and wider neural networks transfer knowledge to shallower and thinner neural networks, as illustrated above in the case of a teacher-student framework. The student network is usually a simplified version of a teacher network with fewer layers and fewer channels in each layer. The smaller version, while preserving the structure, operates more efficiently. 

For their experiments, Sisodia and his team at PicCollage performed model distillation on ViT (visual transformer) model that powers CLIP. In the student model, the width and layers were reduced by a factor of two. The team first began with a dataset of ~200,000 images and gradually increased the size to more than 800,000 images; the original CLIP had 400 million images! To check the performance of distilled CLIP, the team used the COCO testing dataset and checked the top 20 results for each search term. The result was an iPhone-ready CLIP model (<50MB) capable of showing results relevant to the query. Despite showing promising results, the team laments that the mini-CLIP version still falls short in few areas when compared to the original, more powerful model: 

  • Poor performance when it comes to colour based searches.
  • Multiple representation of a search query is superior in the original model.
  • Queries outside the training set can trick the distilled model.

It is a bit of a stretch to even think of outclassing the performance of models such as CLIP, which has been trained by a well-funded organisation with ample resources. The aforementioned tweaks, such as the ones with model distillation, do look encouraging for the ML community as a whole. 

Know more about distilled CLIP here.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.