Active Hackathon

When Transformers Fail

Transformers are the de facto architecture of choice for natural language processing tasks. Since their introduction three years ago, Transformers have undergone several modifications.

Recently, a team of researchers from Google Research found that most modifications do not meaningfully improve transformers’ performance. Some of the popular modifications to Transformers include various activation functions (such as GeLU, Sigmoid, etc.), normalisation, depth, embeddings, and parameter sharing. Most of the Transformer variants found beneficial were either developed in the same codebase or are relatively minor changes, the researchers stated.


Sign up for your weekly dose of what's up in emerging technology.

Why This Research

According to the researchers, there are two possible explanations to using a slightly-modified version of the originally-proposed Transformer

  • The Transformer architecture originally-proposed was near-perfect, and there wasn’t much to do for the developers to improve the architecture.
  • The modifications proposed to the Transformer architecture do not generalise across applications. Meaning, the modifications only help in the limited experimental setting, considering that the modifications’ specific details do not rely on the common details across implementations of the Transformer.

The researchers tried to determine why most modifications proposed to the Transformer have not seen widespread adoption. To understand the modifications, they reimplemented and evaluated a wide variety of Transformer variants on a suite of tasks.

The modified Transformer variants used in this research are-

  • Transparent Attention: This variant of the Transformer creates weighted residual connections along the encoder depth to facilitate gradient flow.
  • Evolved Transformer: The Evolved Transformer is another variant designed via an evolution-based architecture search. 
  • Synthesiser variants: The researchers explore the factorised, dense, and random Synthesizer variants where self-attention is replaced with “synthetic attention” patterns. 
  • Funnel Transformer: Funnel Transformer reduces the sequence length so that it can efficiently encode the input sequence.
  • Dynamic and Lightweight Convolutions: Dynamic convolution uses kernels that are functions of the input at the current time step. On the other hand, Lightweight convolution is a type of depthwise convolution that shares the weights of every subsequent number of m channels where m is a hyperparameter and normalises the weights across the filter dimension.
  • Sparse Expert Transformers: Sparse Expert Transformers, such as Mixture of Experts Transformer, Switch Transformer, among others, replace the feedforward network with sparsely activated experts layers. 
  • Product Key Memory: This variant networks process inputs adaptively, selecting the sparse values.
  • Universal Transformer: This variant implements the same Transformer “block” repetitively to the input sequence. However, instead of applying it a fixed number of times, the Transformer recurrently refines each token’s representation until a halting mechanism is triggered.

Wrapping Up

The researchers found that Transformer modifications exhibit a surprising lack of generalisation across different implementations and tasks. On a concluding note, the researchers suggested some methodologies that will ensure the robustness of future architectural modifications to transformers. 

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.