NLP models process language by generating vector-space representations of fixed-or-variable length. These representations of individual words are used for aggregating information from adjacent words to determine the meaning of a specific prompt or a phrase in the context that it is being used. Thanks to the Transformer architecture, contextual learning enhanced language models. The 2017 seminal work titled “Attention Is All You Need” by A Vaswani et al. created a cascade effect that enabled machines to understand language like never before. Spearheaded by the Transformers models, the NLP revolution has become a multi-billion dollar industry today (e.g., BERT for Google search). The Transformer is based on attention mechanisms, dispensing with recurrence and convolutions entirely. A Transformer network applies a self-attention mechanism to scan every word and grade it with attention scores (weights).
The Transformer is built by stacking blocks composed of self-attention layers followed by fully connected layers. Transformer based models are the workhorse of many NLP based applications. However, with the increasing size of the language models, the computation complexity increases and Transformers, too, struggle while dealing with long contexts. Humans have a remarkable ability to remember the long-term context, which makes their communication more efficient. Whereas, language models suffer from forgetfulness of context. The amount of computation required grows with the context length, and therefore modelling long-term memories can be inefficient.
But, if a system is allowed to increase the number of parameters while keeping the same computational budget by accommodating large memory layers, it can significantly increase the capacity of the architecture with negligible computational overhead. A couple of years ago, researchers from Facebook introduced Memory Layers into language models.
In a key-value memory layer, the input is usually processed through a query network that produces a query vector compared with the rest of the keys. Here, the output is a sparse weighted sum over the memories associated with the selected keys. A handful of memory slots are updated for each input at training time since the sparsity of key selection and parameter updates make training and inference very efficient. With key memory layers, stated the researchers at Facebook, adding a memory layer is more beneficial than increasing the number of layers. Now, researchers from various institutions in Europe, including DeepMind, have collaborated to develop ∞-former or infini-former, an infinite memory transformer that enables unbounded long-term memory. To make this possible, researchers used a continuous-space attention framework that swaps the number of information units that fit into memory (basis functions) for the granularity of their representations.
To enable the model to access long-range context, the researchers extended the vanilla transformer with a continuous long term memory LTM, which stores the input embeddings and hidden states of the previous steps. When representing the memory, wrote the researchers, as a discrete sequence, the new hidden states should be stored in memory. This is not feasible for vanilla transformers when it comes to long contexts as the memory requirements are higher. However, the ∞-former claim the researchers can attend to unbounded context without increasing memory requirements by using continuous attention.
The ∞-former is able to model long contexts, thanks to its continuous-space attention framework that can handle computational complexity independent of the context length. This also allows the model to handle long contexts while keeping the computation budget fixed. By updating the memory with past usage, the model learns to keep “sticky memories”. Sticky memories is a procedure that enforces the persistence of important information in long term memory. To validate the performance of the ∞-former with regards to long contexts, the researchers performed experiments by training a model from scratch and by fine-tuning a pre-trained language model.
The researchers designed a token probability distribution that changes over time to ensure that the long-term memory (LTM) is being effectively used and the Transformer is not sorting by modelling the most recent tokens. In the sorting process, an input consisting of a sequence of tokens is sampled according to a token probability distribution, which the system is oblivious to. The objective here is to generate the tokens in the decreasing order of their frequencies in the sequence.
For the experiments, the researchers considered a vocabulary of 20 tokens and sequences of length 4,000, 8,000, and 16,000. For all models, they used a transformer with three layers and six attention heads and considered sequences of length 1,024 and memory size 2,048. For the compressive Transformer, they have used memories with a size of 1,024. The ∞-former also has a short-term memory of size 1,024 and an LTM with 1,024 basis functions.
The experiments done by the researchers on a synthetic sorting task show that ∞-former maintains high accuracy while scaling up to long sequences. The researchers concluded that experimenting with training models from scratch and fine-tuning a pre-trained language model have shown improvements in perplexity. Infini-former addresses the long-standing challenge of unbounded memory for context, which is crucial if tomorrow’s BERTs and GPTs were to help build better chatbots, better search results on the web, or even help in live translation for diplomats.