Transformers For Vision: 7 Works That Indicate Fusion Is The Future Of AI

Transformers are all geared up to rule the world of computer vision. The runaway success of OpenAI’s CLIP and DALL.E had a lot to do with the sudden interest for multi-model machine learning in research circles. OpenAI co-founder Ilya Sutskever has even forecasted a future gravitating towards these fusion models, that can handle both language and vision tasks. Apart from the recent OpenAI releases, other works have also attracted a great deal of attention in this subdomain.

Let’s take a look at a few of the recent popular works in this field:

Vision Transformer

Google Vision Transformers (ViT) try to replicate the Transformers architecture of natural language processing as close as possible. ViT represents image inputs as sequences and predicts class labels for the image, allowing models to learn image structure independently. Similar to natural language processing, ViT treats input images as a sequence of patches. Every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. According to Google, ViT outperforms state-of-the-art CNN with four times fewer computational resources when trained on sufficient data.

Data-efficient Image Transformer

This year, Facebook announced Data-efficient Image Transformer (DeiT), a vision Transformer that improved on Google’s research on ViT. However, they built a transformer-specific knowledge distillation procedure based on a distillation token to reduce training data requirements. While distillation allows one neural network to learn from another, a distillation token is a learned vector that flows through the network along with the transformed image data to significantly enhance the image classification performance with less training data. DeiT can be trained with 1.2 million images instead of hundreds of millions of images required for CNN.

Image GPT

OpenAI developed Image GPT — an image Transformer that can generate coherent images when trained on pixel sequences. The model understands the 2-D representation for object detection when trained on GPT-2 with long sequences of pixels. The model outperforms several benchmarks without an extensive labelled dataset while training. Though the researchers used the same Transformer architecture of GPT-2, it delivered superior performance even in diverse environments, demonstrating the capability of Transformers for image tasks.

CLIP: Contrastive Language-Image Pre-training

CLIP efficiently learns visual concepts from natural language supervision.  It can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognised, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This technique eliminates the need for proper labeled data to train models, moving away from the cost-intensive approach. The idea is to make models more flexible by training with a wide range of images instead of specific images that traditional neural networks use.

DEtection TRansformer – DETR

DEtection TRansformer is a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. DETR uses CNN as its core for representation of input image and then uses a positional encoder before passing it to a transformer encoder. The approach streamlines the detection pipeline by removing the dated practice of non-maximum suppression procedure or anchor generation for object detection while ensuring the model can generalise better than the state-of-the-art models.

DALL.E

Based on GANs, TransGAN is another text-to-image synthesis model that does not use convolution operations. Both aspects of GAN — generator and discriminator — are entirely based on Transformers and are memory-efficient that progressively increases feature resolution, decreases embedding dimension, and more. With the new interest in using Transformers in vision tasks, the researchers believe that two Transformers will allow researchers and developers to simplify most computer vision problems shortly.

TransGAN: Transformers Based GAN Model

Based on GANs, TransGAN is another text-to-image synthesis model that does not use convolution operations. Both aspects of GAN — generator and discriminator — are completely based on Transformers and are memory efficient that progressively increases feature resolution, decreases embedding dimension, and more. With the new interest of using Transformers in vision tasks, the researchers believe that two Transformers can in future allow researchers and developers to simplify most of the computer vision problems.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.