MITB Banner

Computer Vision Models That Learn From Language

Share

Any typical successful computer vision model first undergoes pre-training on ImageNet and then proceeds to do the tasks such as classification or captioning of the image. But can the vision models learn more from language?

To explore this, two researchers from the University Of Michigan introduced “VirTex”, a pretraining approach to learn visual features via language using fewer images. The aim of this work is to demonstrate that natural language can provide supervision for learning transferable visual representations with better data-efficiency than other approaches.

Captions Over Classification

The current approach for image recognition is first to pre-train a convolutional network to perform image classification on ImageNet. Though this yielded a lot of results, the authors of VirTex consider it to be expensive when it comes to scaling since the pretraining step relies on images annotated by human workers. Techniques such as unsupervised pre-training have also proven to be useful. But, the authors with this work, want to explore alternatives to train models with fewer images.

For example, the popular contrastive methods used for self-supervised learning give results, which don’t encourage variety. The semantic classification limits itself to a single category. As illustrated above, the image classification only names the central features of an image; dog or cat. Whereas, image captioning on multi-label classification, not only detects all important features but also generates relation between them. 

The picture on the right has a dog and apples. The image captioning have the ability to come up with something like:

“A brown and white puppy lying on a green lawn looking at apples.”

Captions can mention many objects as well as attributes, relationships, and actions, giving a semantically dense learning signal. Based on this notion that captions can provide learning ability, the authors have developed VirTex.

The model here consists of ResNet-50 for image recognition tasks, and two unidirectional Transformers for semantic tasks. The training of VirTex model can be summarised as follows:

  • ResNet-50 extracts image features and the textual head predicts captions via bidirectional language modelling (bicaptioning). 
  • The Transformers perform masked multi-headed self-attention over caption features and multi-headed attention over image features. 
  • After pre-training, the visual backbone is transferred to downstream visual recognition tasks.

The visual backbone is a convolutional network. In this case, a ResNet-50 was used, but the authors say that this visual segment of the architecture can be swapped with any convolutional network.

The training is performed on the train2017 split of the COCO Captions dataset, which provides 118K images with five captions each.

The difference between VirTex and other visual-language approaches such as ViLBERT or VisualBERT is that VirTex doesn’t follow the routine of pre-training on ImageNet followed by fine-tuning, followed by treating language as downstream from vision. On the contrary, VirTex learns visual features directly from language supervision.

Another benefit of textual annotations, wrote the authors, is simplified data collection. To collect classification labels, typically human experts first build an ontology of categories then complex crowdsourcing pipelines are used to elicit labels from non-expert users. In contrast, natural language descriptions do not require an explicit ontology and can easily be written by non-expert workers, leading to a simplified data collection pipeline. 

Key Takeaways

  • VirTex is a pre-training approach using semantically dense captions to learn visual representations.
  • Can learn high-quality visual representations from fewer images.
  • VirTex yields features that match or exceed those learned on ImageNet – supervised or unsupervised.
  • Natural language descriptions can easily be written by non-expert workers, which makes data collection pipeline simple.

Link to paper.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.