Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models
Image © Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

While image classification sounds like a simple process to human ears, it can be a daunting task for machines. Thus, purely based on attention, neural networks have dramatically improved image understanding tasks; however, these visual transformers have typically been pre-trained with massive sets of images using expensive infrastructures. This, in turn, restrains a larger part of the community from adopting the same. To address this issue, researchers at Facebook AI have developed a new technique — Data-efficient image Transformers (DeiT) — that leverages Transformers but requires far fewer data to produce an image classification model. 

Explaining the research, in a recent blog, Facebook AI stated that the researchers want to showcase that Transformers can be trained efficiently with only standard academic data sets, for image classification. With this, the researchers also aim to extend Transformers to new use cases and advance the field of computer vision.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

In a recent tweet, the company has even announced open sourcing the system for researchers and engineers who don’t have access to large-scale systems for training massive AI models.

Also Read: A New Trend Of Training GANs With Less Data: NVIDIA Joins The Gang

What is Data-efficient image Transformers?

Considering Transformer model architecture has managed to create some breakthroughs for NLP and machine translation, Facebook AI researchers decided to deploy the same for some tasks like speech recognition, symbolic maths and translation of programming languages. On the other hand, this new technique, Data-efficient image Transformers (DeiT,) requires far fewer data and computing resources to provide results on image classification tasks.

To facilitate this, the researchers of Facebook AI collaborated with Prof Matthieu Cord from Sorbonne University. They trained the DeiT model with a single 8-GPU server in two to three days. This provided a result of 84.2 top-1 accuracies on ImageNet, without using any external data for training. 

The performance curve of the comparison between DeiT, visual Transformer models and CNNs.

The researchers even noted that the proposed model produced competitive results compared to the dominant convolutional neural networks.

Also Read: Are Easy-To-Interpret Neurons Necessary? New Findings By Facebook AI

How Does It Work?

With no statistical priors about images, image classification was challenging for convolution-free transformer models like DeiT. Thus, they have to “see” a massive set of images in order to classify the different objects in the image. 

To avoid this issue, the first critical step was to create a training strategy for DeiT. For this, the researchers adapted existing research on convolutional neural networks, particularly for data augmentation, optimisation and regularisation. Therefore, DeiT gave competitive results despite being trained on 1.2 million images rather than hundreds of them.

Secondly, the researchers modified the Transformer architecture to enable native distillation that helped neural networks (the student) to learn from the output of another network, that acts as a teacher. They used CNN as a teacher for the DeiT model, and since such an architecture comes with priors about images, it was easier for them to train it with lesser images.

The distillation token interacting with the classification vector and image component tokens through the attention layers.

Further, the student model was learning from two different sources — labelled data set and from the teacher; thus, there was a possibility of diverging knowledge. To make the student model only learn from the teacher, the researchers introduced a distillation token that cues the model for its distillation output. According to the researchers, the method of native distillation was explicitly designed for Transformers to improve the image classification performance.

Wrapping Up

With this, it can easily be said that DeiT will surely make advancements in computer vision, using Transformers. The experimentation showcased that the resulting output by the proposed model has been competitive with that of convolutional neural networks, that has been dominating the field of CV since almost a decade. Therefore, such a convolution-free transformer model like DeiT will surely democratise artificial intelligence, making it possible for developers to train AI models with fewer data.
Read the paper here.

More Great AIM Stories

Sejuti Das
Sejuti currently works as Associate Editor at Analytics India Magazine (AIM). Reach out at sejuti.das@analyticsindiamag.com

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.