OpenAI has garnered a reputation for dishing out state-of-the-art models that can be immediately commercialised through their private beta releases via in-house APIs. After the super successful GPT-3 release last year, the Microsoft partnered company has released CLIP, a neural network, to efficiently learn visual concepts from natural language supervision. All one has to do is simply provide the names of the visual categories to be recognised, similar to the “zero-shot” capabilities of GPT-2 and GPT-3, and apply the CLIP model to any visual classification benchmark.
Both CLIP and DALL.E models of OpenAI are products of the ML community’s relentless efforts to combine the advantages of both vision and language models. OpenAI’s co-founder Ilya Sutskever, too, has stressed their importance going forward. With CLIP, the company tried to address one of the most pressing questions that still bother the community: “Are these benchmark smashing, “expensive” data devouring models restricted to big labs?” Smaller organisations and individual researchers would like to get their hands on these models, experiment, and come up with innovations of their own. Though the private beta release and APIs solve this problem to an extent, there are still debates around this.
Traditionally, vision models have been trained on manually labelled datasets that are expensive to construct and only provide supervision for a limited number of predetermined visual concepts. For instance, the popular ImageNet dataset, according to OpenAI, required over 25,000 workers to annotate 14 million images for 22,000 object categories. Whereas CLIP showed that it can learn from the text–image pairs already publicly available on the internet. In this way, using the CLIP model mitigated the need for expensive large labelled datasets. Though it takes only a few minutes to train models like ResNet, the substantial computational complexity and massive storage requirements make it a great challenge to deploy them in real-time applications or on edge (for example, smartphones).
Now, a team at PicCollage has come up with an even more compact version of CLIP. The image-editing app maker has recently claimed to make a lighter version of OpenAI’s famed CLIP model and even run it effectively on iOS. To do this, the team used model distillation to reduce the size of the CLIP model (the ViT model) and got promising results. “Given the magnitude of the dataset and compute required, it seemed like a daunting task, but we wanted to give it a shot anyway,” wrote the team in their blog post.
A Brief Overview Of CLIP
CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. Using CLIP, OpenAI demonstrated that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets.
As illustrated above, the CLIP model pre-trains an image encoder and a text encoder to predict which images were paired with which texts in OpenAI’s dataset. This behaviour is then used to turn CLIP into a zero-shot classifier. Finally, we convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.
Clipping For Edge
According to PicCollage, the objective of their experiment was to use the distillation paradigm to reduce the size of CLIP and explore the possibility of deploying it on edge. The sizes of the original and the distilled models are 350MB and 48MB (24MB) with FP32 (FP16) precision, respectively. “Distilled models are converted to CoreML format to run on iOS and observed negligible difference between the search results of FP16 and FP32 versions,” wrote Vinay Sisodia, ML Engineer at PicCollage.
Distillation techniques were designed to reduce the complexity of the networks by targeting depth and width. Typical deeper and wider neural networks transfer knowledge to shallower and thinner neural networks, as illustrated above in the case of a teacher-student framework. The student network is usually a simplified version of a teacher network with fewer layers and fewer channels in each layer. The smaller version, while preserving the structure, operates more efficiently.
For their experiments, Sisodia and his team at PicCollage performed model distillation on ViT (visual transformer) model that powers CLIP. In the student model, the width and layers were reduced by a factor of two. The team first began with a dataset of ~200,000 images and gradually increased the size to more than 800,000 images; the original CLIP had 400 million images! To check the performance of distilled CLIP, the team used the COCO testing dataset and checked the top 20 results for each search term. The result was an iPhone-ready CLIP model (<50MB) capable of showing results relevant to the query. Despite showing promising results, the team laments that the mini-CLIP version still falls short in few areas when compared to the original, more powerful model:
- Poor performance when it comes to colour based searches.
- Multiple representation of a search query is superior in the original model.
- Queries outside the training set can trick the distilled model.
It is a bit of a stretch to even think of outclassing the performance of models such as CLIP, which has been trained by a well-funded organisation with ample resources. The aforementioned tweaks, such as the ones with model distillation, do look encouraging for the ML community as a whole.
Know more about distilled CLIP here.