Transformers are all geared up to rule the world of computer vision. The runaway success of OpenAI’s CLIP and DALL.E had a lot to do with the sudden interest for multi-model machine learning in research circles. OpenAI co-founder Ilya Sutskever has even forecasted a future gravitating towards these fusion models, that can handle both language and vision tasks. Apart from the recent OpenAI releases, other works have also attracted a great deal of attention in this subdomain.
Let’s take a look at a few of the recent popular works in this field:
Vision Transformer
Google Vision Transformers (ViT) try to replicate the Transformers architecture of natural language processing as close as possible. ViT represents image inputs as sequences and predicts class labels for the image, allowing models to learn image structure independently. Similar to natural language processing, ViT treats input images as a sequence of patches. Every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. According to Google, ViT outperforms state-of-the-art CNN with four times fewer computational resources when trained on sufficient data.
Data-efficient Image Transformer
This year, Facebook announced Data-efficient Image Transformer (DeiT), a vision Transformer that improved on Google’s research on ViT. However, they built a transformer-specific knowledge distillation procedure based on a distillation token to reduce training data requirements. While distillation allows one neural network to learn from another, a distillation token is a learned vector that flows through the network along with the transformed image data to significantly enhance the image classification performance with less training data. DeiT can be trained with 1.2 million images instead of hundreds of millions of images required for CNN.
Image GPT
OpenAI developed Image GPT — an image Transformer that can generate coherent images when trained on pixel sequences. The model understands the 2-D representation for object detection when trained on GPT-2 with long sequences of pixels. The model outperforms several benchmarks without an extensive labelled dataset while training. Though the researchers used the same Transformer architecture of GPT-2, it delivered superior performance even in diverse environments, demonstrating the capability of Transformers for image tasks.
CLIP: Contrastive Language-Image Pre-training
CLIP efficiently learns visual concepts from natural language supervision. It can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognised, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This technique eliminates the need for proper labeled data to train models, moving away from the cost-intensive approach. The idea is to make models more flexible by training with a wide range of images instead of specific images that traditional neural networks use.
DEtection TRansformer – DETR
DE⫶TR: End-to-End Object Detection with Transformers https://t.co/4W27PVx6Js (+paper https://t.co/xmxnSOBjBa) awesome to see a solid swing at (non-autoregressive) end-to-end detection. Anchor boxes + nms is a mess. (I was hoping detection would go end-to-end back in ~2013) pic.twitter.com/cN4hK2ZFY0
— Andrej Karpathy (@karpathy) May 27, 2020
DEtection TRansformer is a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. DETR uses CNN as its core for representation of input image and then uses a positional encoder before passing it to a transformer encoder. The approach streamlines the detection pipeline by removing the dated practice of non-maximum suppression procedure or anchor generation for object detection while ensuring the model can generalise better than the state-of-the-art models.
DALL.E
Based on GANs, TransGAN is another text-to-image synthesis model that does not use convolution operations. Both aspects of GAN — generator and discriminator — are entirely based on Transformers and are memory-efficient that progressively increases feature resolution, decreases embedding dimension, and more. With the new interest in using Transformers in vision tasks, the researchers believe that two Transformers will allow researchers and developers to simplify most computer vision problems shortly.
TransGAN: Transformers Based GAN Model
Based on GANs, TransGAN is another text-to-image synthesis model that does not use convolution operations. Both aspects of GAN — generator and discriminator — are completely based on Transformers and are memory efficient that progressively increases feature resolution, decreases embedding dimension, and more. With the new interest of using Transformers in vision tasks, the researchers believe that two Transformers can in future allow researchers and developers to simplify most of the computer vision problems.