Transformers are the de facto architecture of choice for natural language processing tasks. Since their introduction three years ago, Transformers have undergone several modifications.
Recently, a team of researchers from Google Research found that most modifications do not meaningfully improve transformers’ performance. Some of the popular modifications to Transformers include various activation functions (such as GeLU, Sigmoid, etc.), normalisation, depth, embeddings, and parameter sharing. Most of the Transformer variants found beneficial were either developed in the same codebase or are relatively minor changes, the researchers stated.
Why This Research
According to the researchers, there are two possible explanations to using a slightly-modified version of the originally-proposed Transformer–
- The Transformer architecture originally-proposed was near-perfect, and there wasn’t much to do for the developers to improve the architecture.
- The modifications proposed to the Transformer architecture do not generalise across applications. Meaning, the modifications only help in the limited experimental setting, considering that the modifications’ specific details do not rely on the common details across implementations of the Transformer.
The researchers tried to determine why most modifications proposed to the Transformer have not seen widespread adoption. To understand the modifications, they reimplemented and evaluated a wide variety of Transformer variants on a suite of tasks.
The modified Transformer variants used in this research are-
- Transparent Attention: This variant of the Transformer creates weighted residual connections along the encoder depth to facilitate gradient flow.
- Evolved Transformer: The Evolved Transformer is another variant designed via an evolution-based architecture search.
- Synthesiser variants: The researchers explore the factorised, dense, and random Synthesizer variants where self-attention is replaced with “synthetic attention” patterns.
- Funnel Transformer: Funnel Transformer reduces the sequence length so that it can efficiently encode the input sequence.
- Dynamic and Lightweight Convolutions: Dynamic convolution uses kernels that are functions of the input at the current time step. On the other hand, Lightweight convolution is a type of depthwise convolution that shares the weights of every subsequent number of m channels where m is a hyperparameter and normalises the weights across the filter dimension.
- Sparse Expert Transformers: Sparse Expert Transformers, such as Mixture of Experts Transformer, Switch Transformer, among others, replace the feedforward network with sparsely activated experts layers.
- Product Key Memory: This variant networks process inputs adaptively, selecting the sparse values.
- Universal Transformer: This variant implements the same Transformer “block” repetitively to the input sequence. However, instead of applying it a fixed number of times, the Transformer recurrently refines each token’s representation until a halting mechanism is triggered.
Wrapping Up
The researchers found that Transformer modifications exhibit a surprising lack of generalisation across different implementations and tasks. On a concluding note, the researchers suggested some methodologies that will ensure the robustness of future architectural modifications to transformers.