Transformers For Vision: 7 Works That Indicate Fusion Is The Future Of AI

Transformers are all geared up to rule the world of computer vision. The runaway success of OpenAI’s CLIP and DALL.E had a lot to do with the sudden interest for multi-model machine learning in research circles. OpenAI co-founder Ilya Sutskever has even forecasted a future gravitating towards these fusion models, that can handle both language and vision tasks. Apart from the recent OpenAI releases, other works have also attracted a great deal of attention in this subdomain.

Let’s take a look at a few of the recent popular works in this field:

Vision Transformer

Google Vision Transformers (ViT) try to replicate the Transformers architecture of natural language processing as close as possible. ViT represents image inputs as sequences and predicts class labels for the image, allowing models to learn image structure independently. Similar to natural language processing, ViT treats input images as a sequence of patches. Every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. According to Google, ViT outperforms state-of-the-art CNN with four times fewer computational resources when trained on sufficient data.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Data-efficient Image Transformer

This year, Facebook announced Data-efficient Image Transformer (DeiT), a vision Transformer that improved on Google’s research on ViT. However, they built a transformer-specific knowledge distillation procedure based on a distillation token to reduce training data requirements. While distillation allows one neural network to learn from another, a distillation token is a learned vector that flows through the network along with the transformed image data to significantly enhance the image classification performance with less training data. DeiT can be trained with 1.2 million images instead of hundreds of millions of images required for CNN.

Image GPT

OpenAI developed Image GPT — an image Transformer that can generate coherent images when trained on pixel sequences. The model understands the 2-D representation for object detection when trained on GPT-2 with long sequences of pixels. The model outperforms several benchmarks without an extensive labelled dataset while training. Though the researchers used the same Transformer architecture of GPT-2, it delivered superior performance even in diverse environments, demonstrating the capability of Transformers for image tasks.


Download our Mobile App



CLIP: Contrastive Language-Image Pre-training

CLIP efficiently learns visual concepts from natural language supervision.  It can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognised, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This technique eliminates the need for proper labeled data to train models, moving away from the cost-intensive approach. The idea is to make models more flexible by training with a wide range of images instead of specific images that traditional neural networks use.

DEtection TRansformer – DETR

DEtection TRansformer is a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. DETR uses CNN as its core for representation of input image and then uses a positional encoder before passing it to a transformer encoder. The approach streamlines the detection pipeline by removing the dated practice of non-maximum suppression procedure or anchor generation for object detection while ensuring the model can generalise better than the state-of-the-art models.

DALL.E

Based on GANs, TransGAN is another text-to-image synthesis model that does not use convolution operations. Both aspects of GAN — generator and discriminator — are entirely based on Transformers and are memory-efficient that progressively increases feature resolution, decreases embedding dimension, and more. With the new interest in using Transformers in vision tasks, the researchers believe that two Transformers will allow researchers and developers to simplify most computer vision problems shortly.

TransGAN: Transformers Based GAN Model

Based on GANs, TransGAN is another text-to-image synthesis model that does not use convolution operations. Both aspects of GAN — generator and discriminator — are completely based on Transformers and are memory efficient that progressively increases feature resolution, decreases embedding dimension, and more. With the new interest of using Transformers in vision tasks, the researchers believe that two Transformers can in future allow researchers and developers to simplify most of the computer vision problems.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.