The key component of the transformer architecture is the attention module. Its job is to figure out the matching pairs (think: Translation) in a sequence through similarity scores. When the length of a sequence increases, calculating similarity scores for all pairs gets inefficient. So, the researchers have come up with the sparse attention technique where it computes only a few pairs and cuts downtime and memory requirements.
According to Google researchers, sparse attention methods still suffer from a number of limitations:
- They require efficient sparse-matrix multiplication operations, which are not available on all accelerators.
- They do not provide rigorous theoretical guarantees for their representation power.
- Optimised primarily for Transformer models and generative pre-training.
- Difficult to use with other pre-trained models as they usually stack more attention layers to compensate for sparse representations, thus requiring retraining and significant energy consumption.
- Not sufficient to address the full range of problems to which regular attention methods are applied, such as Pointer Networks.
Along with these, there are also some operations that cannot be sparsified, such as the commonly used softmax operation, which normalises similarity scores in the attention mechanism and is used heavily in industry-scale recommender systems.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
To overcome the limitations of sparse transformers, Google introduced Performers, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths.
Overview Of Performers
The Performer uses an efficient (linear) generalised attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels). The framework is implemented by the novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, which provides scalable low-variance and unbiased estimation of attention mechanisms that can be expressed by regular softmax-attention. Regular softmax-attention can be seen as a special case with these nonlinear functions defined by exponential functions and Gaussian projections.
The key behind making any attention matrix to work efficiently is the use of positive random features, i.e., positive-valued nonlinear functions of the original queries and keys.
To evaluate Performers, the researchers ran an experiment on the protein sequences. Proteins are large molecules with complex 3D structures, and like words, proteins are specified as linear sequences where each character is one of 20 amino acid building blocks.
Applying Transformers to large unlabeled corpora of protein sequences yields models that can be used to make accurate predictions about the protein folds. Performer (ReLU-based attention), stated the researchers, performed strongly at modelling protein sequence data, while Performer-Softmax matches the performance of the Transformer.
According to the researchers, Performers matched the performance of Transformers and showed promises of applications beyond Transformers. This work, concluded the researchers, is an attempt to diversify research in the area of non-sparse attention techniques and Performers is a brand new way of thinking about attention, Transformer architectures, and even kernel methods.
Implications Of Performers
According to the researchers, this framework can have significance in the following areas:
Performers have the potential to directly impact research on biological sequence analysis by enabling the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix. Modern bioinformatics can immensely benefit from faster, more accurate language models, for development of new nanoparticle vaccines.
Performers on FAVOR algorithm lead to much lower compute costs and substantially lower space complexity which can be directly translated to CO2 emission reduction and lower energy consumption, as regular Transformers require very large computational resources.
According to the researchers, FAVOR can also be applied to applications which are outside the scope of Transformers. This opens up Performers to a wide range of avenues, including hierarchical attention networks, graph attention networks, image processing, and reinforcement learning.
Check the original paper here.