MITB Banner

Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

Share

Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

Illustration by Tech Behind Facebook AI’s Latest Technique To Train Computer Vision Models

While image classification sounds like a simple process to human ears, it can be a daunting task for machines. Thus, purely based on attention, neural networks have dramatically improved image understanding tasks; however, these visual transformers have typically been pre-trained with massive sets of images using expensive infrastructures. This, in turn, restrains a larger part of the community from adopting the same. To address this issue, researchers at Facebook AI have developed a new technique — Data-efficient image Transformers (DeiT) — that leverages Transformers but requires far fewer data to produce an image classification model. 

Explaining the research, in a recent blog, Facebook AI stated that the researchers want to showcase that Transformers can be trained efficiently with only standard academic data sets, for image classification. With this, the researchers also aim to extend Transformers to new use cases and advance the field of computer vision.

In a recent tweet, the company has even announced open sourcing the system for researchers and engineers who don’t have access to large-scale systems for training massive AI models.

Also Read: A New Trend Of Training GANs With Less Data: NVIDIA Joins The Gang

What is Data-efficient image Transformers?

Considering Transformer model architecture has managed to create some breakthroughs for NLP and machine translation, Facebook AI researchers decided to deploy the same for some tasks like speech recognition, symbolic maths and translation of programming languages. On the other hand, this new technique, Data-efficient image Transformers (DeiT,) requires far fewer data and computing resources to provide results on image classification tasks.

To facilitate this, the researchers of Facebook AI collaborated with Prof Matthieu Cord from Sorbonne University. They trained the DeiT model with a single 8-GPU server in two to three days. This provided a result of 84.2 top-1 accuracies on ImageNet, without using any external data for training. 

The performance curve of the comparison between DeiT, visual Transformer models and CNNs.

The researchers even noted that the proposed model produced competitive results compared to the dominant convolutional neural networks.

Also Read: Are Easy-To-Interpret Neurons Necessary? New Findings By Facebook AI

How Does It Work?

With no statistical priors about images, image classification was challenging for convolution-free transformer models like DeiT. Thus, they have to “see” a massive set of images in order to classify the different objects in the image. 

To avoid this issue, the first critical step was to create a training strategy for DeiT. For this, the researchers adapted existing research on convolutional neural networks, particularly for data augmentation, optimisation and regularisation. Therefore, DeiT gave competitive results despite being trained on 1.2 million images rather than hundreds of them.

Secondly, the researchers modified the Transformer architecture to enable native distillation that helped neural networks (the student) to learn from the output of another network, that acts as a teacher. They used CNN as a teacher for the DeiT model, and since such an architecture comes with priors about images, it was easier for them to train it with lesser images.

The distillation token interacting with the classification vector and image component tokens through the attention layers.

Further, the student model was learning from two different sources — labelled data set and from the teacher; thus, there was a possibility of diverging knowledge. To make the student model only learn from the teacher, the researchers introduced a distillation token that cues the model for its distillation output. According to the researchers, the method of native distillation was explicitly designed for Transformers to improve the image classification performance.

Wrapping Up

With this, it can easily be said that DeiT will surely make advancements in computer vision, using Transformers. The experimentation showcased that the resulting output by the proposed model has been competitive with that of convolutional neural networks, that has been dominating the field of CV since almost a decade. Therefore, such a convolution-free transformer model like DeiT will surely democratise artificial intelligence, making it possible for developers to train AI models with fewer data.
Read the paper here.

Share
Picture of Sejuti Das

Sejuti Das

Sejuti currently works as Associate Editor at Analytics India Magazine (AIM). Reach out at sejuti.das@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.