When Transformers Fail

Transformers are the de facto architecture of choice for natural language processing tasks. Since their introduction three years ago, Transformers have undergone several modifications.

Recently, a team of researchers from Google Research found that most modifications do not meaningfully improve transformers’ performance. Some of the popular modifications to Transformers include various activation functions (such as GeLU, Sigmoid, etc.), normalisation, depth, embeddings, and parameter sharing. Most of the Transformer variants found beneficial were either developed in the same codebase or are relatively minor changes, the researchers stated.

Why This Research

According to the researchers, there are two possible explanations to using a slightly-modified version of the originally-proposed Transformer

  • The Transformer architecture originally-proposed was near-perfect, and there wasn’t much to do for the developers to improve the architecture.
  • The modifications proposed to the Transformer architecture do not generalise across applications. Meaning, the modifications only help in the limited experimental setting, considering that the modifications’ specific details do not rely on the common details across implementations of the Transformer.

The researchers tried to determine why most modifications proposed to the Transformer have not seen widespread adoption. To understand the modifications, they reimplemented and evaluated a wide variety of Transformer variants on a suite of tasks.

The modified Transformer variants used in this research are-

  • Transparent Attention: This variant of the Transformer creates weighted residual connections along the encoder depth to facilitate gradient flow.
  • Evolved Transformer: The Evolved Transformer is another variant designed via an evolution-based architecture search. 
  • Synthesiser variants: The researchers explore the factorised, dense, and random Synthesizer variants where self-attention is replaced with “synthetic attention” patterns. 
  • Funnel Transformer: Funnel Transformer reduces the sequence length so that it can efficiently encode the input sequence.
  • Dynamic and Lightweight Convolutions: Dynamic convolution uses kernels that are functions of the input at the current time step. On the other hand, Lightweight convolution is a type of depthwise convolution that shares the weights of every subsequent number of m channels where m is a hyperparameter and normalises the weights across the filter dimension.
  • Sparse Expert Transformers: Sparse Expert Transformers, such as Mixture of Experts Transformer, Switch Transformer, among others, replace the feedforward network with sparsely activated experts layers. 
  • Product Key Memory: This variant networks process inputs adaptively, selecting the sparse values.
  • Universal Transformer: This variant implements the same Transformer “block” repetitively to the input sequence. However, instead of applying it a fixed number of times, the Transformer recurrently refines each token’s representation until a halting mechanism is triggered.

Wrapping Up

The researchers found that Transformer modifications exhibit a surprising lack of generalisation across different implementations and tasks. On a concluding note, the researchers suggested some methodologies that will ensure the robustness of future architectural modifications to transformers. 

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM