When Transformers Fail

Transformers are the de facto architecture of choice for natural language processing tasks. Since their introduction three years ago, Transformers have undergone several modifications.

Recently, a team of researchers from Google Research found that most modifications do not meaningfully improve transformers’ performance. Some of the popular modifications to Transformers include various activation functions (such as GeLU, Sigmoid, etc.), normalisation, depth, embeddings, and parameter sharing. Most of the Transformer variants found beneficial were either developed in the same codebase or are relatively minor changes, the researchers stated.

Why This Research

According to the researchers, there are two possible explanations to using a slightly-modified version of the originally-proposed Transformer

  • The Transformer architecture originally-proposed was near-perfect, and there wasn’t much to do for the developers to improve the architecture.
  • The modifications proposed to the Transformer architecture do not generalise across applications. Meaning, the modifications only help in the limited experimental setting, considering that the modifications’ specific details do not rely on the common details across implementations of the Transformer.

The researchers tried to determine why most modifications proposed to the Transformer have not seen widespread adoption. To understand the modifications, they reimplemented and evaluated a wide variety of Transformer variants on a suite of tasks.

The modified Transformer variants used in this research are-

  • Transparent Attention: This variant of the Transformer creates weighted residual connections along the encoder depth to facilitate gradient flow.
  • Evolved Transformer: The Evolved Transformer is another variant designed via an evolution-based architecture search. 
  • Synthesiser variants: The researchers explore the factorised, dense, and random Synthesizer variants where self-attention is replaced with “synthetic attention” patterns. 
  • Funnel Transformer: Funnel Transformer reduces the sequence length so that it can efficiently encode the input sequence.
  • Dynamic and Lightweight Convolutions: Dynamic convolution uses kernels that are functions of the input at the current time step. On the other hand, Lightweight convolution is a type of depthwise convolution that shares the weights of every subsequent number of m channels where m is a hyperparameter and normalises the weights across the filter dimension.
  • Sparse Expert Transformers: Sparse Expert Transformers, such as Mixture of Experts Transformer, Switch Transformer, among others, replace the feedforward network with sparsely activated experts layers. 
  • Product Key Memory: This variant networks process inputs adaptively, selecting the sparse values.
  • Universal Transformer: This variant implements the same Transformer “block” repetitively to the input sequence. However, instead of applying it a fixed number of times, the Transformer recurrently refines each token’s representation until a halting mechanism is triggered.

Wrapping Up

The researchers found that Transformer modifications exhibit a surprising lack of generalisation across different implementations and tasks. On a concluding note, the researchers suggested some methodologies that will ensure the robustness of future architectural modifications to transformers. 

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

Career Building in ML & AI

31st May | Online

31st May - 1st Jun '23 | Online

Rakuten Product Conference 2023

15th June | Online

Building LLM powered applications using LangChain

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox