In computer vision applications, attention is either applied along with CNNs or used to replace certain components of these convolutional networks while keeping their overall structure in place. But convolutional architectures still remain dominant.
Last week, a paper under double-blind review for ICLR 2021 enthused the ML community. The paper titled, ‘An image is worth 16X16 words’ was discussed by the likes of Tesla AI head, Andrej Karpathy, among many others.
Sign up for your weekly dose of what's up in emerging technology.
Ever since the seminal paper “Attention Is All You Need,” transformers have rekindled the interest in language models. While the transformer architecture has become the go-to solution for many natural language processing tasks, its applications to computer vision remain limited.
Transformers are slowly becoming popular with tasks as diverse as speech recognition, symbolic mathematics, and even reinforcement learning.
In the latest work under review at ICLR 2021, the anonymous authors claim that their results show that the vision transformer can go toe to toe with the state of the art models on image recognition benchmarks, reaching accuracies as high as 88.36% on ImageNet and 94.55% on CIFAR-100.
Overview Of Visual Transformers
The vision transformer, as illustrated above, receives input as a one-dimensional sequence of token embeddings. To handle 2D images, the image is reshaped into a sequence of flattened 2D patches. The authors state that the transformers in this work use constant widths through all of its layers, to give a trainable linear projection maps each vectorised patch to the model dimension which in turn lead to patch embeddings.
These patch embeddings are combined with positional embeddings to retain positional information. The joint embeddings are fed as input to the encoder. The authors have also recommended a hybrid architecture where instead of dividing the image into patches, it is divided into intermediate feature maps of a ResNet model.
- Image is split into fixed-size patches.
- Patches are linearly embedded.
- Position embeddings are added to the resulting sequence of vectors.
- Patches are fed to a standard transformer encoder.
- Extra learnable ”classification token” is added to the sequence to perform classification.
The authors evaluated the representation learning capabilities on three models: ResNet, Vision Transformer (ViT), and the hybrid. Models are pre-trained on datasets of varying size and evaluate many benchmark tasks.
Datasets used: ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images.
The model trained on this dataset is then validated on benchmark tasks over CIFAR -10/100, Oxford-IIIT Pets etc.
The above table depicts how with state of the art on popular image classification datasets benchmarks. The first comparison point is Big Transfer (BiT), which performs supervised transfer learning with large ResNets. The second is Noisy Student, which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M with the labels removed.
All models were trained on TPUv3 hardware, and the number of TPUv3-days taken to pre-train can be seen at the bottom of the table.
When considering the computational cost of pre-training the model, the paper claims that ViT performs very favourably, attaining state-of-the-art far on most recognition benchmarks.
The smaller ViT model matches or outperforms BiT-L on all datasets while requiring substantially less computational resources to train. Whereas, the larger mode further improves the performance on ImageNet and CIFAR-100.
- Vision Transformer matches or exceeds state-of-the-art on many image classification datasets, whilst being relatively cheap to pre-train.
- Initial experiments show improvement from self-supervised pre-training, but there is still a large gap between self-supervised and large-scale supervised pre-training
The authors conclude that detection and segmentation are few challenges that they would like to explore. However, there have been previous works where transformers are made to detect objects. Facebook has recently released DETR or detection transformers, but these networks use CNNs. Whereas ViTs are poised to make convolutions obsolete.
Check the full paper here.