While image classification sounds like a simple process to human ears, it can be a daunting task for machines. Thus, purely based on attention, neural networks have dramatically improved image understanding tasks; however, these visual transformers have typically been pre-trained with massive sets of images using expensive infrastructures. This, in turn, restrains a larger part of the community from adopting the same. To address this issue, researchers at Facebook AI have developed a new technique — Data-efficient image Transformers (DeiT) — that leverages Transformers but requires far fewer data to produce an image classification model.
Explaining the research, in a recent blog, Facebook AI stated that the researchers want to showcase that Transformers can be trained efficiently with only standard academic data sets, for image classification. With this, the researchers also aim to extend Transformers to new use cases and advance the field of computer vision.
In a recent tweet, the company has even announced open sourcing the system for researchers and engineers who don’t have access to large-scale systems for training massive AI models.
Sign up for your weekly dose of what's up in emerging technology.
What is Data-efficient image Transformers?
Considering Transformer model architecture has managed to create some breakthroughs for NLP and machine translation, Facebook AI researchers decided to deploy the same for some tasks like speech recognition, symbolic maths and translation of programming languages. On the other hand, this new technique, Data-efficient image Transformers (DeiT,) requires far fewer data and computing resources to provide results on image classification tasks.
To facilitate this, the researchers of Facebook AI collaborated with Prof Matthieu Cord from Sorbonne University. They trained the DeiT model with a single 8-GPU server in two to three days. This provided a result of 84.2 top-1 accuracies on ImageNet, without using any external data for training.
The performance curve of the comparison between DeiT, visual Transformer models and CNNs.
The researchers even noted that the proposed model produced competitive results compared to the dominant convolutional neural networks.
How Does It Work?
With no statistical priors about images, image classification was challenging for convolution-free transformer models like DeiT. Thus, they have to “see” a massive set of images in order to classify the different objects in the image.
To avoid this issue, the first critical step was to create a training strategy for DeiT. For this, the researchers adapted existing research on convolutional neural networks, particularly for data augmentation, optimisation and regularisation. Therefore, DeiT gave competitive results despite being trained on 1.2 million images rather than hundreds of them.
Secondly, the researchers modified the Transformer architecture to enable native distillation that helped neural networks (the student) to learn from the output of another network, that acts as a teacher. They used CNN as a teacher for the DeiT model, and since such an architecture comes with priors about images, it was easier for them to train it with lesser images.
The distillation token interacting with the classification vector and image component tokens through the attention layers.
Further, the student model was learning from two different sources — labelled data set and from the teacher; thus, there was a possibility of diverging knowledge. To make the student model only learn from the teacher, the researchers introduced a distillation token that cues the model for its distillation output. According to the researchers, the method of native distillation was explicitly designed for Transformers to improve the image classification performance.
With this, it can easily be said that DeiT will surely make advancements in computer vision, using Transformers. The experimentation showcased that the resulting output by the proposed model has been competitive with that of convolutional neural networks, that has been dominating the field of CV since almost a decade. Therefore, such a convolution-free transformer model like DeiT will surely democratise artificial intelligence, making it possible for developers to train AI models with fewer data.
Read the paper here.