Last updated October 17, 2020
In AI Mysteries

Computer Vision Models That Learn From Language

Published on June 20, 2020
by Ram Sagar

Any typical successful computer vision model first undergoes pre-training on ImageNet and then proceeds to do the tasks such as classification or captioning of the image. But can the vision models learn more from language?

To explore this, two researchers from the University Of Michigan introduced “VirTex”, a pretraining approach to learn visual features via language using fewer images. The aim of this work is to demonstrate that natural language can provide supervision for learning transferable visual representations with better data-efficiency than other approaches.

Captions Over Classification

Introducing "VirTex": a pretraining approach to learn visual features via language using fewer images.
Pretrain: CNN+Transformer from scratch on COCO Captions.
Transfer CNN: Results on 6 vision tasks match/exceed ImageNet pretraining (10x size wrt COCO)!https://t.co/A3F00jmT9N pic.twitter.com/WnbLkktE1C
— Karan Desai (KD) (@kdexd) June 12, 2020

The current approach for image recognition is first to pre-train a convolutional network to perform image classification on ImageNet. Though this yielded a lot of results, the authors of VirTex consider it to be expensive when it comes to scaling since the pretraining step relies on images annotated by human workers. Techniques such as unsupervised pre-training have also proven to be useful. But, the authors with this work, want to explore alternatives to train models with fewer images.

For example, the popular contrastive methods used for self-supervised learning give results, which don’t encourage variety. The semantic classification limits itself to a single category. As illustrated above, the image classification only names the central features of an image; dog or cat. Whereas, image captioning on multi-label classification, not only detects all important features but also generates relation between them.

The picture on the right has a dog and apples. The image captioning have the ability to come up with something like:

“A brown and white puppy lying on a green lawn looking at apples.”

Captions can mention many objects as well as attributes, relationships, and actions, giving a semantically dense learning signal. Based on this notion that captions can provide learning ability, the authors have developed VirTex.

The model here consists of ResNet-50 for image recognition tasks, and two unidirectional Transformers for semantic tasks. The training of VirTex model can be summarised as follows:

ResNet-50 extracts image features and the textual head predicts captions via bidirectional language modelling (bicaptioning).
The Transformers perform masked multi-headed self-attention over caption features and multi-headed attention over image features.
After pre-training, the visual backbone is transferred to downstream visual recognition tasks.

The visual backbone is a convolutional network. In this case, a ResNet-50 was used, but the authors say that this visual segment of the architecture can be swapped with any convolutional network.

The training is performed on the train2017 split of the COCO Captions dataset, which provides 118K images with five captions each.

The difference between VirTex and other visual-language approaches such as ViLBERT or VisualBERT is that VirTex doesn’t follow the routine of pre-training on ImageNet followed by fine-tuning, followed by treating language as downstream from vision. On the contrary, VirTex learns visual features directly from language supervision.

Another benefit of textual annotations, wrote the authors, is simplified data collection. To collect classification labels, typically human experts first build an ontology of categories then complex crowdsourcing pipelines are used to elicit labels from non-expert users. In contrast, natural language descriptions do not require an explicit ontology and can easily be written by non-expert workers, leading to a simplified data collection pipeline.

Key Takeaways

VirTex is a pre-training approach using semantically dense captions to learn visual representations.
Can learn high-quality visual representations from fewer images.
VirTex yields features that match or exceed those learned on ImageNet – supervised or unsupervised.
Natural language descriptions can easily be written by non-expert workers, which makes data collection pipeline simple.

Link to paper.

Access all our open Survey & Awards Nomination forms in one place >>

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Computer Vision Models That Learn From Language

Captions Over Classification

Key Takeaways

Ram Sagar

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru