MITB Banner

When Transformers Fail

Share

Transformers are the de facto architecture of choice for natural language processing tasks. Since their introduction three years ago, Transformers have undergone several modifications.

Recently, a team of researchers from Google Research found that most modifications do not meaningfully improve transformers’ performance. Some of the popular modifications to Transformers include various activation functions (such as GeLU, Sigmoid, etc.), normalisation, depth, embeddings, and parameter sharing. Most of the Transformer variants found beneficial were either developed in the same codebase or are relatively minor changes, the researchers stated.

Why This Research

According to the researchers, there are two possible explanations to using a slightly-modified version of the originally-proposed Transformer

  • The Transformer architecture originally-proposed was near-perfect, and there wasn’t much to do for the developers to improve the architecture.
  • The modifications proposed to the Transformer architecture do not generalise across applications. Meaning, the modifications only help in the limited experimental setting, considering that the modifications’ specific details do not rely on the common details across implementations of the Transformer.

The researchers tried to determine why most modifications proposed to the Transformer have not seen widespread adoption. To understand the modifications, they reimplemented and evaluated a wide variety of Transformer variants on a suite of tasks.

The modified Transformer variants used in this research are-

  • Transparent Attention: This variant of the Transformer creates weighted residual connections along the encoder depth to facilitate gradient flow.
  • Evolved Transformer: The Evolved Transformer is another variant designed via an evolution-based architecture search. 
  • Synthesiser variants: The researchers explore the factorised, dense, and random Synthesizer variants where self-attention is replaced with “synthetic attention” patterns. 
  • Funnel Transformer: Funnel Transformer reduces the sequence length so that it can efficiently encode the input sequence.
  • Dynamic and Lightweight Convolutions: Dynamic convolution uses kernels that are functions of the input at the current time step. On the other hand, Lightweight convolution is a type of depthwise convolution that shares the weights of every subsequent number of m channels where m is a hyperparameter and normalises the weights across the filter dimension.
  • Sparse Expert Transformers: Sparse Expert Transformers, such as Mixture of Experts Transformer, Switch Transformer, among others, replace the feedforward network with sparsely activated experts layers. 
  • Product Key Memory: This variant networks process inputs adaptively, selecting the sparse values.
  • Universal Transformer: This variant implements the same Transformer “block” repetitively to the input sequence. However, instead of applying it a fixed number of times, the Transformer recurrently refines each token’s representation until a halting mechanism is triggered.

Wrapping Up

The researchers found that Transformer modifications exhibit a surprising lack of generalisation across different implementations and tasks. On a concluding note, the researchers suggested some methodologies that will ensure the robustness of future architectural modifications to transformers. 

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.