Any typical successful computer vision model first undergoes pre-training on ImageNet and then proceeds to do the tasks such as classification or captioning of the image. But can the vision models learn more from language?
To explore this, two researchers from the University Of Michigan introduced “VirTex”, a pretraining approach to learn visual features via language using fewer images. The aim of this work is to demonstrate that natural language can provide supervision for learning transferable visual representations with better data-efficiency than other approaches.
Captions Over Classification
The current approach for image recognition is first to pre-train a convolutional network to perform image classification on ImageNet. Though this yielded a lot of results, the authors of VirTex consider it to be expensive when it comes to scaling since the pretraining step relies on images annotated by human workers. Techniques such as unsupervised pre-training have also proven to be useful. But, the authors with this work, want to explore alternatives to train models with fewer images.
For example, the popular contrastive methods used for self-supervised learning give results, which don’t encourage variety. The semantic classification limits itself to a single category. As illustrated above, the image classification only names the central features of an image; dog or cat. Whereas, image captioning on multi-label classification, not only detects all important features but also generates relation between them.
The picture on the right has a dog and apples. The image captioning have the ability to come up with something like:
“A brown and white puppy lying on a green lawn looking at apples.”
Captions can mention many objects as well as attributes, relationships, and actions, giving a semantically dense learning signal. Based on this notion that captions can provide learning ability, the authors have developed VirTex.
The model here consists of ResNet-50 for image recognition tasks, and two unidirectional Transformers for semantic tasks. The training of VirTex model can be summarised as follows:
- ResNet-50 extracts image features and the textual head predicts captions via bidirectional language modelling (bicaptioning).
- The Transformers perform masked multi-headed self-attention over caption features and multi-headed attention over image features.
- After pre-training, the visual backbone is transferred to downstream visual recognition tasks.
The visual backbone is a convolutional network. In this case, a ResNet-50 was used, but the authors say that this visual segment of the architecture can be swapped with any convolutional network.
The training is performed on the train2017 split of the COCO Captions dataset, which provides 118K images with five captions each.
The difference between VirTex and other visual-language approaches such as ViLBERT or VisualBERT is that VirTex doesn’t follow the routine of pre-training on ImageNet followed by fine-tuning, followed by treating language as downstream from vision. On the contrary, VirTex learns visual features directly from language supervision.
Another benefit of textual annotations, wrote the authors, is simplified data collection. To collect classification labels, typically human experts first build an ontology of categories then complex crowdsourcing pipelines are used to elicit labels from non-expert users. In contrast, natural language descriptions do not require an explicit ontology and can easily be written by non-expert workers, leading to a simplified data collection pipeline.
- VirTex is a pre-training approach using semantically dense captions to learn visual representations.
- Can learn high-quality visual representations from fewer images.
- VirTex yields features that match or exceed those learned on ImageNet – supervised or unsupervised.
- Natural language descriptions can easily be written by non-expert workers, which makes data collection pipeline simple.
Link to paper.