Last updated February 2, 2021
In AI Mysteries

Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

Share

Illustration by Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

Published on January 6, 2021

by Sejuti Das

While image classification sounds like a simple process to human ears, it can be a daunting task for machines. Thus, purely based on attention, neural networks have dramatically improved image understanding tasks; however, these visual transformers have typically been pre-trained with massive sets of images using expensive infrastructures. This, in turn, restrains a larger part of the community from adopting the same. To address this issue, researchers at Facebook AI have developed a new technique — Data-efficient image Transformers (DeiT) — that leverages Transformers but requires far fewer data to produce an image classification model.

Explaining the research, in a recent blog, Facebook AI stated that the researchers want to showcase that Transformers can be trained efficiently with only standard academic data sets, for image classification. With this, the researchers also aim to extend Transformers to new use cases and advance the field of computer vision.

In a recent tweet, the company has even announced open sourcing the system for researchers and engineers who don’t have access to large-scale systems for training massive AI models.

We’re open-sourcing a new system to train computer vision models using Transformers. Data-efficient image Transformers (DeiT) is a high-performance image classification model requiring less data & computing resources to train than previous AI models. https://t.co/nTJ4bD9Slp pic.twitter.com/OMdB5stBay
— Meta AI (@MetaAI) December 24, 2020

Also Read: A New Trend Of Training GANs With Less Data: NVIDIA Joins The Gang

What is Data-efficient image Transformers?

Considering Transformer model architecture has managed to create some breakthroughs for NLP and machine translation, Facebook AI researchers decided to deploy the same for some tasks like speech recognition, symbolic maths and translation of programming languages. On the other hand, this new technique, Data-efficient image Transformers (DeiT,) requires far fewer data and computing resources to provide results on image classification tasks.

To facilitate this, the researchers of Facebook AI collaborated with Prof Matthieu Cord from Sorbonne University. They trained the DeiT model with a single 8-GPU server in two to three days. This provided a result of 84.2 top-1 accuracies on ImageNet, without using any external data for training.

The performance curve of the comparison between DeiT, visual Transformer models and CNNs.

The researchers even noted that the proposed model produced competitive results compared to the dominant convolutional neural networks.

Also Read: Are Easy-To-Interpret Neurons Necessary? New Findings By Facebook AI

How Does It Work?

With no statistical priors about images, image classification was challenging for convolution-free transformer models like DeiT. Thus, they have to “see” a massive set of images in order to classify the different objects in the image.

To avoid this issue, the first critical step was to create a training strategy for DeiT. For this, the researchers adapted existing research on convolutional neural networks, particularly for data augmentation, optimisation and regularisation. Therefore, DeiT gave competitive results despite being trained on 1.2 million images rather than hundreds of them.

Secondly, the researchers modified the Transformer architecture to enable native distillation that helped neural networks (the student) to learn from the output of another network, that acts as a teacher. They used CNN as a teacher for the DeiT model, and since such an architecture comes with priors about images, it was easier for them to train it with lesser images.

The distillation token interacting with the classification vector and image component tokens through the attention layers.

Further, the student model was learning from two different sources — labelled data set and from the teacher; thus, there was a possibility of diverging knowledge. To make the student model only learn from the teacher, the researchers introduced a distillation token that cues the model for its distillation output. According to the researchers, the method of native distillation was explicitly designed for Transformers to improve the image classification performance.

Wrapping Up

With this, it can easily be said that DeiT will surely make advancements in computer vision, using Transformers. The experimentation showcased that the resulting output by the proposed model has been competitive with that of convolutional neural networks, that has been dominating the field of CV since almost a decade. Therefore, such a convolution-free transformer model like DeiT will surely democratise artificial intelligence, making it possible for developers to train AI models with fewer data.
Read the paper here.

Access all our open Survey & Awards Nomination forms in one place