Computer Vision Models That Learn From Language

Any typical successful computer vision model first undergoes pre-training on ImageNet and then proceeds to do the tasks such as classification or captioning of the image. But can the vision models learn more from language?

To explore this, two researchers from the University Of Michigan introduced “VirTex”, a pretraining approach to learn visual features via language using fewer images. The aim of this work is to demonstrate that natural language can provide supervision for learning transferable visual representations with better data-efficiency than other approaches.

Captions Over Classification

The current approach for image recognition is first to pre-train a convolutional network to perform image classification on ImageNet. Though this yielded a lot of results, the authors of VirTex consider it to be expensive when it comes to scaling since the pretraining step relies on images annotated by human workers. Techniques such as unsupervised pre-training have also proven to be useful. But, the authors with this work, want to explore alternatives to train models with fewer images.

For example, the popular contrastive methods used for self-supervised learning give results, which don’t encourage variety. The semantic classification limits itself to a single category. As illustrated above, the image classification only names the central features of an image; dog or cat. Whereas, image captioning on multi-label classification, not only detects all important features but also generates relation between them. 

The picture on the right has a dog and apples. The image captioning have the ability to come up with something like:

“A brown and white puppy lying on a green lawn looking at apples.”

Captions can mention many objects as well as attributes, relationships, and actions, giving a semantically dense learning signal. Based on this notion that captions can provide learning ability, the authors have developed VirTex.

The model here consists of ResNet-50 for image recognition tasks, and two unidirectional Transformers for semantic tasks. The training of VirTex model can be summarised as follows:

  • ResNet-50 extracts image features and the textual head predicts captions via bidirectional language modelling (bicaptioning). 
  • The Transformers perform masked multi-headed self-attention over caption features and multi-headed attention over image features. 
  • After pre-training, the visual backbone is transferred to downstream visual recognition tasks.

The visual backbone is a convolutional network. In this case, a ResNet-50 was used, but the authors say that this visual segment of the architecture can be swapped with any convolutional network.

The training is performed on the train2017 split of the COCO Captions dataset, which provides 118K images with five captions each.

The difference between VirTex and other visual-language approaches such as ViLBERT or VisualBERT is that VirTex doesn’t follow the routine of pre-training on ImageNet followed by fine-tuning, followed by treating language as downstream from vision. On the contrary, VirTex learns visual features directly from language supervision.

Another benefit of textual annotations, wrote the authors, is simplified data collection. To collect classification labels, typically human experts first build an ontology of categories then complex crowdsourcing pipelines are used to elicit labels from non-expert users. In contrast, natural language descriptions do not require an explicit ontology and can easily be written by non-expert workers, leading to a simplified data collection pipeline. 

Key Takeaways

  • VirTex is a pre-training approach using semantically dense captions to learn visual representations.
  • Can learn high-quality visual representations from fewer images.
  • VirTex yields features that match or exceed those learned on ImageNet – supervised or unsupervised.
  • Natural language descriptions can easily be written by non-expert workers, which makes data collection pipeline simple.

Link to paper.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?