Comprehensive Guide To Transformers

Transformers have been a game changer in many artificial intelligence fields, including natural language processing (NLP), computer vision, video and audio processing, and even other disciplines, such as chemistry and life sciences. BERT, RoBERTa, DeBERTa, GPT-3, Transformer XL, DALL.E, HuggingFacePegasus, Self-AttentionCV, etc are some of the popular transformer libraries and frameworks.

Lately, transformers (aka X-formers) have piqued the interest of researchers and developers for their efficiency, generalisation and adoption. 

In a recent paper, ‘A Survey of Transformers,’ researchers from the Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, China and School of Computer Science, Fudan University, China, gave a comprehensive overview of various transformers.

Vanilla transformer 

 Vanilla transformer is a sequence-to-sequence or s2s model that consists of an ‘encoder’ and a ‘decoder,’ each of which is a stack of L identical blocks. Each encoder block comprises a multi-head self-attention module and a position-wise feed-forward network (FFN). 

A multi-head self-attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. FFN, on the other hand, is a biologically inspired classification algorithm consisting of simple neuron-like processing units organised in layers, and every unit in a layer is connected with all the units in the previous layer. 

For building a deep model, a residual connection is employed around each module, followed by a Layer Normalisation module. Compared to the encoder blocks, decoder blocks insert cross-attention modules between the ‘multi-head self-attention modules’ and the ‘position-wise FFNs.’ Furthermore, the self-attention modules in the ‘decoder’ are adapted to prevent each position from attending to subsequent locations. 

The overall architecture of the vanilla transform is shown below. 

An overview of vanilla transformer architecture (Source: arXiv)

How are Transformers different?

As a central piece of the transformer, self-attention comes with a flexible mechanism to deal with variable-length inputs. It can be recognised as a fully connected layer where the weights are dynamically generated from pairwise relations from inputs. 

The below table compares the complexity, sequential operations, and maximum path length of self-attention with three commonly used layer types, where T is the sequence length, D representations dimension, and K is the kernel size of convolutions.

The table shows per-layer complexity, the minimum number of sequential operations and maximum lengths for different layer types. (Source: arXiv)

The advantages of self-attention include: 

  • It has a maximum path length as fully connected layers, making it suitable for long-range dependencies modeling. It is more parameter efficient and more flexible in handling variable-length inputs compared to fully connected layers.
  • Because of the limited receptive field of convolutional layers, one ideally needs to stack a deep network to have a global receptive field. The constant maximum path length, on the other hand, enables self-attention to model long-range dependencies with a constant number of layers. 
  • The steady sequential operations and maximum path length make self-attention more parallelisable and better at long-range modeling than recurrent layers. 

Recently, Google introduced MLP-Mixer, an architecture based exclusively on multi-layered perceptrons (MLPs) for computer vision. Here, the computational complexity of this model is linear in the number of input patches, unlike transformers whose complexity is quadratic. The MLP model uses skip connections and regularisation

In terms of inductive biases, convolutional networks impose the inductive preferences of translation invariance and locality with shared local kernel functions. On the other hand, recurrent networks carry the inductive biases of temporal invariance and locality via their Markovian structure. However, the transformer architecture makes few assumptions about the structural information of data, making the transformer a universal and flexible architecture. As a result, the lack of structural bias makes it prone to overfitting for small-scale data.

Another closely related network type is GNNs with message passing. Transformer can be seen as a GNN defined over a complete directed graph where each input is a node in the graph. Compared to GNNs, the transformer introduces no prior knowledge over how input data is structured — the message passing process in the transformer solely depends on similarity measures over the content. 

A new taxonomy 

At present, a wide variety of models have been proposed and are mainly based on the vanilla transformer from three perspectives: 

  • Types of architecture modification, 
  • Pre-training methods (PTMs) 
  • Applications

The below images illustrate the categorisation of transformer variants and the taxonomy of transformers. 

Categorisation of transformers (Source: arXiv

Taxonomy of transformers (Source: arXiv

Theoretical analysis

Transformer architecture has been demonstrated to be capable of supporting large-scale training datasets with enough parameters. 

  • Transformer has a larger capacity than CNNs and RNNs and can handle a tremendous amount of training data. 
  • When a transformer is trained on sufficient data, it usually delivers better performances than CNNs or RNNs. 
  • The transformer is more flexible than CNNs and RNNs

Beyond attention

One of the key benefits of the transformer is using the attention mechanism to model the global dependencies among nodes within input data. However, many studies have shown that full attention is unnecessary for most nodes. It is, to an extent, inefficient to indistinguishably calculate attention for all nodes.

There is plenty of room for improvements in efficiently modeling global interactions. On the other hand, the self-attention module can be regarded as a fully connected neural network with dynamical connection weights, amalgamating non-local information with dynamic routing. Dynamic routing mechanisms are alternative techniques worth exploring, alongside global interaction, that can also be modeled by other neural networks, such as memory-enhanced models. 

A unified framework for multimodal data 

Integrating multimodal data is useful and essential to boost task performance in many application scenarios. The general AI needs the ability to capture the semantic relations across different modalities.   

Since transformers achieve great success on text, images, video, and audio, there is a chance to build a single framework and better capture the inherent connections among multimodal data. However, researchers said the design of the intra-modal and cross-modal attention needs to be improved. 

“We wish this study to be a hands-on reference for better understanding the current research progress on transformers and help readers to improve transformers for various applications,” wrote the researchers. 

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM