MITB Banner

Can Language Models Have Super Memory?

Snippet: With increasing size of the language models, the computation complexity increases and Transformers, too, struggle while dealing with long contexts.

Share

NLP models process language by generating vector-space representations of fixed-or-variable length. These representations of individual words are used for aggregating information from adjacent words to determine the meaning of a specific prompt or a phrase in the context that it is being used. Thanks to the Transformer architecture, contextual learning enhanced language models. The 2017 seminal work titled “Attention Is All You Need” by A Vaswani et al. created a cascade effect that enabled machines to understand language like never before. Spearheaded by the Transformers models, the NLP revolution has become a multi-billion dollar industry today (e.g., BERT for Google search). The Transformer is based on attention mechanisms, dispensing with recurrence and convolutions entirely. A Transformer network applies a self-attention mechanism to scan every word and grade it with attention scores (weights). 

The Transformer is built by stacking blocks composed of self-attention layers followed by fully connected layers. Transformer based models are the workhorse of many NLP based applications. However, with the increasing size of the language models, the computation complexity increases and Transformers, too, struggle while dealing with long contexts. Humans have a remarkable ability to remember the long-term context, which makes their communication more efficient. Whereas, language models suffer from forgetfulness of context. The amount of computation required grows with the context length, and therefore modelling long-term memories can be inefficient. 

(via Paper by Lample et al.,)

But, if a system is allowed to increase the number of parameters while keeping the same computational budget by accommodating large memory layers, it can significantly increase the capacity of the architecture with negligible computational overhead. A couple of years ago, researchers from Facebook introduced Memory Layers into language models. 

In a key-value memory layer, the input is usually processed through a query network that produces a query vector compared with the rest of the keys. Here, the output is a sparse weighted sum over the memories associated with the selected keys. A handful of memory slots are updated for each input at training time since the sparsity of key selection and parameter updates make training and inference very efficient. With key memory layers, stated the researchers at Facebook, adding a memory layer is more beneficial than increasing the number of layers. Now, researchers from various institutions in Europe, including DeepMind, have collaborated to develop ∞-former or infini-former, an infinite memory transformer that enables unbounded long-term memory. To make this possible, researchers used a continuous-space attention framework that swaps the number of information units that fit into memory (basis functions) for the granularity of their representations.

About ∞-former 

∞-former’s attention diagram via paper by Martins et al.

To enable the model to access long-range context, the researchers extended the vanilla transformer with a continuous long term memory LTM, which stores the input embeddings and hidden states of the previous steps. When representing the memory, wrote the researchers, as a discrete sequence, the new hidden states should be stored in memory. This is not feasible for vanilla transformers when it comes to long contexts as the memory requirements are higher. However, the ∞-former claim the researchers can attend to unbounded context without increasing memory requirements by using continuous attention.

The ∞-former is able to model long contexts, thanks to its continuous-space attention framework that can handle computational complexity independent of the context length. This also allows the model to handle long contexts while keeping the computation budget fixed. By updating the memory with past usage, the model learns to keep “sticky memories”. Sticky memories is a procedure that enforces the persistence of important information in long term memory. To validate the performance of the ∞-former with regards to long contexts, the researchers performed experiments by training a model from scratch and by fine-tuning a pre-trained language model.

The researchers designed a token probability distribution that changes over time to ensure that the long-term memory (LTM) is being effectively used and the Transformer is not sorting by modelling the most recent tokens. In the sorting process, an input consisting of a sequence of tokens is sampled according to a token probability distribution, which the system is oblivious to. The objective here is to generate the tokens in the decreasing order of their frequencies in the sequence.

For the experiments, the researchers considered a vocabulary of 20 tokens and sequences of length 4,000, 8,000, and 16,000. For all models, they used a transformer with three layers and six attention heads and considered sequences of length 1,024 and memory size 2,048. For the compressive Transformer, they have used memories with a size of 1,024. The ∞-former also has a short-term memory of size 1,024 and an LTM with 1,024 basis functions. 

The experiments done by the researchers on a synthetic sorting task show that ∞-former maintains high accuracy while scaling up to long sequences. The researchers concluded that experimenting with training models from scratch and fine-tuning a pre-trained language model have shown improvements in perplexity. Infini-former addresses the long-standing challenge of unbounded memory for context, which is crucial if tomorrow’s BERTs and GPTs were to help build better chatbots, better search results on the web, or even help in live translation for diplomats.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.