A Complete Learning Path To Transformers (With Guide To 23 Architectures)

The attention mechanism in Transformers began a revolution in deep learning that led to numerous researches in different domains
Attention Transformers

Transformers, introduced in 2017 by Ashish Vaswani, et al., began a revolution in deep learning. The attention mechanism incorporated in Transformers led to numerous researches in different domains, though the original Transformers were meant only for natural language processing. This article discusses most of the transformers introduced so far, spanning from NLP to Image processing to Video generation. 

Transformers overcame difficulties faced by recurrent neural networks (RNNs) such as memory constraints due to serially processing a lot of parameters at each time step and failure to effectively utilize parallel processing hardware systems such as GPUs and TPUs. Moreover, Transformers can handle sequential data and contextual data that a convolutional neural network lacks. A couple of libraries are released exclusively for Transformers, their functionalities, and their variants, including HuggingFace and Self-AttentionCV.

Read more here.

Natural Language Processing


When language modeling architectures read a text sentence either from left to right or from right to left, BERT, the Bidirectional Encoder Representations from Transformers, reads a sentence in whole in both directions. It demonstrated that it can understand the context of the text very well. When building an end-to-end language model is costly because of numerous parameters that take several hours to train, BERT was developed by Google to be used as a pre-trained base for language modeling. BERT still remains one of the preferred pre-trained models though many variants and extensions have been developed after BERT.

Read more about BERT here.

2| Transformer XL

Transformer XL is a large size transformer developed to address a few issues faced by BERT. BERT can handle only a fixed sized input. BERT could not have variable length inputs, while real-world text sentences possess sentences of different lengths. Transformer XL can take text sentences of different lengths as its input. This helps Transformer XL give more importance to context yield by a whole sentence. To do this, Transformer XL stores hidden states of the previous steps and uses them in the current time-step.

Read more here

3| XL-Net

XL-Net is an extension of Transformer XL with a few more improvements. It resolves an important problem that the baseline BERT faced. BERT infers the next words based on a probability distribution, while the predicted words are almost independent to each other. This sometimes affects the context of predicted sentences. XL-Net introduced a bi-directional positional encoder that could solve this problem.

Read more here

4| RoBERTa

RoBERTa, introduced by Facebook AI Research, is a 350-million-parameter Language Transformer built with BERT’s template. This model was trained with more data than BERT for a very long duration of more than 25,000 GPU hours to generalize it among different tasks. A couple of variants of RoBERTa were introduced in a task-specific manner with minor modifications.

Read more about RoBERTa and its variants here.


The goal of releasing BERT – use as a pre-trained model with fine-tuning as per task – became impractical in many situations because of the number of parameters that BERT has. In this scenario, ALBERT, A Lite BERT was introduced with 90% reduction in parameters while performing close to the baseline BERT. ALBERT implements multi-head attention layers that share parameters. Since parameters are shared across layers, the total number of parameters is contained greatly while improving the performance by stacking multiple layers.

Read more here.

6| DistilBERT

DistilBERT is a variant of BERT developed with an objective identical to that of ALBERT. DistilBERT has 40% fewer parameters than the baseline BERT while retaining 97% performance of the baseline. Further, it consumed a relatively small training duration compared to the baseline BERT. It was reported that DistilBERT’s training was 60% faster than the baseline BERT. The model was trained exclusively for Next Sentence Prediction with Dynamic Masking principle. DistilBERT used almost half of the layers to reduce the number of parameters compared to the baseline BERT. In addition, it removed an embedding layer and a pooling layer. Its pre-trained version is available for deployment through the HuggingFace library.

Read more here


ELECTRA is a yet efficient model that surpassed every other model released before it by compute power, training time and data requirements. BERT and other variants discussed above use language masking to train the model. While 15% of training data was masked, the models were trained to find those masked tokens. 85% remaining data was used to find the 15% masked data. This became the major reason BERT and its variants required a large amount of data. ELECTRA uses a BERT-like model to generate tokens from masks, and it identifies whether the generated token is original or masked while providing it to the generator. Thus ELECTRA behaves as a discriminator and trains with minimal data to achieve extraordinary performance.

Read more here.

8| DeBERTa

DeBERTa, another variant of BERT, introduced disentangled attention mechanism and enhanced mask decoder that significantly improved model performance. Disentangled attention mechanism suggests two vector representations for each input token, one to represent the word and another to represent the position. Further, the enhanced mask decoder suggests absolute positioning of input masks that can help generate tokens efficiently. DeBERTa, developed by Microsoft, outperformed every other BERT variant. 

Read more here.

9| Reformer

Google’s Reformer made Transformers more efficient with three new techniques: Locality-sensitive hashing attention, Chunked feed-forward layers, and Reversible residual layers. These techniques help improve memory consumption and efficiency. This Reformer, unlike other Transformers, can generate very long sentences with great context retention. Because of its robustness, Reformer can be applied to different domains such as image, video and music generation and time-series analysis.

Read more here.  

10| Evolved Transformer

Google’s attempt to find a well-performing transformer in the task of Neural Machine Translation through its Neural Architecture Search (NAS) resulted in the Evolved Transformer. This AI developed Evolved Transformer outperformed any other human developed Transformer variantduring its publication in the famous translation benchmark – WMT tasks. The Evolved Transformer has a few architectural modifications in the attention block and the original transformer’s convolution block, while the functionality remains almost unchanged.

Read more here

11| GPT-3

Which transformer is the biggest one? The obvious answer is GPT-3. GPT stands for Generative Pre-Training Transformer. The first version, the GPT, was released before BERT with just 110 million parameters. OpenAI has released the latest version of GPT, the GPT-3, in 2020. It has 175 billion parameters being trained on enormous data that no other model has been trained with. Being trained with a variety of data, GPT-3 can generate text, even codes, in many domains with great contextual accuracy. While a large community celebrates GPT-3, its size makes it practically non-deployable in all devices.

Read more here.


While text generating models behave like a poet or a writer in generating texts as they wish, control over generated text becomes crucial to improve reliability on text generators. ENCONTER can generate constrained texts with entities as constraints. For example, if this model is provided with the job title and skills of the job seeker, it can generate a job description.

Read more here

13| CTRL

Salesforce’s CTRL, the Conditional Transformer Language model, supports control on text generation via many attributes, called control codes. Attributes can be adjusted finely to obtain texts to human expectations. Texts can be conditioned by date, time, contents, names, and even URLs. While the next word is conditioned on the previous words in other language models, CTRL also conditions the control code’s prediction.

Read more here

Image Processing

14| Vision Transformer

When Transformers are thought to be confined to Natural Language Processing, Vision Transformer broke that rule. It proved that attention mechanisms can effectively process images also. It breaks an image into many patches and feeds the model with a sequence of these patches. Vision Transformer learned the key features greatly and surpassed most convolutional neural network models in its time. Vision Transformer opened a new gate to address tasks from many domains with Transformers.

Read more here.  

15| DeiT

When Vision Transformer outperformed many carefully-curated CNNs on the ImageNet dataset, it needed a lot of external data along with ImageNet. Because of more data, the Vision Transformer required more compute power also. In this scenario, DeiT, the Data-Efficient Image Transformer was introduced, that was trained only on ImageNet with relatively less compute power while achieving a new milestone on top-1 accuracy scale. While the Vision Transformer employed a multi-layer perceptron as decision head, DeiT employed a linear classifier. 

Read more here.

16| DALL.E  

In continuation with the tremendous success of GPT-3, OpenAI released DALL.E, a text-image inter-domain transformer – a variant of GPT-3. DALL.E was trained with a series of texts and corresponding images, such that if a text prompt is provided to the model, it can generate an image or a set of images regarding the prompt. Generated image can be controlled by providing texts with more attributes. 

Read more here.

 17| Image-Recipe Hierarchical Transformer

Amazon’s Image-Recipe Hierarchical Transformer is another inter-domain transformer that can output the recipe if provided with a photograph of food or vice-versa. It employs a hierarchical transformer architecture with two parallel encoders, one for image and another for recipe text. It is robust enough to train either with image-recipe pairs in a supervised mode or recipe-only in a self-supervised mode. 

Read more here.

18| DETR

DETR, the Detection Transformers, is a mixed variant of a CNN and a Transformer that detects objects in an image with bounding boxes. DETR employs a CNN as a feature extractor and a transformer encoder to further proceed with the extracted features. It greatly performs in object detection, classification and localization, close to well-acclaimed CNN-only models. Further, fine-tuning of DETR can be achieved with unannotated image data in an unsupervised manner. This approach of DETR is called UpDETR. 

Read more about DETR here and UpDETR here

19| Point Transformer

3D data is collected mostly in the point-cloud form. Point clouds have data points in sparse sets. Since the attention mechanism is basically a set operator, point-cloud data can be effectively and efficiently processed by transformer architectures. Many contemporary researches have been made recently to develop Transformers for point-cloud 3D data, and they are successful in accomplishing state-of-the-art results in the respective tasks they have been applied with.

Read more here

Medical Image Segmentation 

20| Medical Transformer

Though image segmentation is widely applied in many fields such as autonomous driving, robot navigation, and virtual reality, Medical Image Segmentation is the most critical application among all. Transformer architecture with some minor modifications resulted in the Medical Transformer, meant exclusively for Medical Image Segmentation with greater accuracy.

Read more here.

21| TransUNet

TransUNet is a great blend of convolutional neural networks and Transformers. Following the successes of skip-connections of the UNet architecture and attention mechanisms of the Transformer architecture, TransUNet was developed to incorporate those functionalities in one architecture. TransUNet captures the local features with convolutional layers, while it captures the global contexts with Transformer layers. TransUNet outperformed the CNN-only UNet versions in Medical Image Segmentation.

Read more here.

Video Processing

22| Video GPT

Video GPT, the video generating Transformer, has a carefully curated architecture that incorporates a Vector Quantized Variational AutoEncoder (VQ-VAE) and self-attention blocks oflearning from raw videos and of generating high-quality video outputs. This architecture follows the footprints of the original GPT architecture. It generates realistic videos conditioned by sampling that compete with GANs.

Read more here.  

23| TimeSformer

Facebook AI Research developed a Transformer-only architecture called TimeSformer, exclusively for video processing. TimeSformer outshined most CNN-only models and CNN-Transformer blends in real-time video processing tasks such as Kinetics-400 and Kinetics-600. This Transformer captures both local and global relationships among image patches to achieve greater results. Moreover, TimeSformer is more compute-efficient compared to its competitors that struggle with heavy computation needs caused by 3D convolutional neural networks.

Read more here.   

References and Further Reading

Got interested in Transformers? Read more about Transformers, a few more interesting variants and their implementation with the following resources. 

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox