LLM Systems Will Soon Have Infinite Context Length

Microsoft, Google, and Meta, have all been taking strides in this direction – making context length infinite.

Share

Illustration by Nikhil Kumar

Published on April 18, 2024

by Mohit Pandey

Listen to this story

LLMs forget. Everyone knows that. The primary culprit behind this is the finity of context length of the models. Some even say that it is the biggest bottleneck when it comes to achieving AGI.

Soon, it appears that the debate over which model boasts the largest context length will become irrelevant. Microsoft, Google, and Meta, have all been taking strides in this direction – making context length infinite.

The end of Transformers?

While all LLMs are currently running on Transformers, it might soon become a thing of the past. For example, Meta has introduced MEGALODON, a neural architecture designed for efficient sequence modelling with unlimited context length.

MEGALODON aims to overcome the limitations of the Transformer architecture, such as its quadratic computational complexity and limited inductive bias for length generalisation. The model demonstrates superior efficiency at a scale of 7 billion parameters and 2 trillion training tokens, outperforming other models such as Llama 2 in terms of training loss.

It introduces key innovations such as the complex exponential moving average (CEMA) component and timestep normalisation layer, which improve long-context pretraining and data efficiency. These improvements enable MEGALODON to excel in various tasks, including instruction fine-tuning, image classification, and auto-regressive language modelling.

Most likely, upcoming Meta’s Llama 3 will be based on MEGALODON architecture, making it infinite context length.

Similarly, Google researchers have introduced a method called Infini-Attention, which incorporates compressive memory into the vanilla attention mechanism. The paper titled ‘Leave No Context Behind’ says that Infini-Attention incorporates compressive memory into the vanilla attention mechanism and combines masked local attention and long-term linear attention mechanisms in a single Transformer block.

This approach combines masked local attention and long-term linear attention mechanisms in a single Transformer block, allowing existing LLMs to handle infinitely long contexts with bounded memory and computation.

The approach scales naturally to handle million-length input sequences and outperforms baselines on long-context language modelling benchmarks and book summarisation tasks. The 1B model, fine-tuned on up to 5K sequence length passkey instances, successfully solved the 1M length problem.

Forgetting to forget

Along similar lines, another team of researchers from Google introduced Feedback Attention Memory (FAM). It’s a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations, fostering the emergence of working memory within the Transformer and allowing it to process infinitely long sequences.

The introduction of FAM offers a new approach by adding feedback activations that feed contextual representation back into each block of sliding window attention. This enables integrated attention, block-wise updates, information compression, and global contextual storage.

Besides, researchers from Beijing Academy of AI introduced Activation Beacon, a method that extends LLMs’ context length by condensing raw activations into compact forms. This plug-in component enables LLMs to perceive long contexts while retaining their performance within shorter contexts.

Activation Beacon uses a sliding window approach for stream processing, enhancing efficiency in training and inference. By training with short-sequence data and varying condensing ratios, Activation Beacon supports different context lengths at a low training cost. Experiments validate Activation Beacon as an effective, efficient, and low-cost solution for extending LLMs’ context length.

Do we even need tokens?

In February, Microsoft Research published the paper titled ‘LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens’. The technique significantly increases the context length of LLMs to an unprecedented 2048k tokens, while preserving their original performance within shorter context windows.

Moving beyond that, another team of Microsoft researchers have challenged the traditional approach to LLM pre-training, which uniformly applies a next-token prediction loss to all tokens in a training corpus. Instead, they propose a new language model called RHO-1, which utilises Selective Language Modeling (SLM).

The SLM approach directly addresses this issue by focusing on the token level and eliminating the loss of undesired tokens during pre-training.

SLM first trains a reference language model on high-quality corpora to establish utility metrics for scoring tokens according to the desired distribution. Tokens with a high excess loss between the reference and training models are selected for training, focusing the language model on those that best benefit downstream applications.

No more ‘lost in the middle’?

There has been a long-going conversation about how longer context length window models have the problem of getting lost in the middle. Opting for smaller context-length inputs is recommended for accuracy, even with the advent of long-context LLMs. Notably, facts at the input’s beginning and end are better retained than those in the middle.

Jim Fan from NVIDIA AI explains how claims of a million or billion tokens are not helpful when it comes to improving LLMs. “What truly matters is how well the model actually uses the context. It’s easy to make seemingly wild claims, but much harder to solve real problems better,” he said.

Meanwhile, to measure the efficiency of these longer context lengths, NVIDIA researchers developed RULER, a synthetic benchmark designed to evaluate long-context language models across various task categories, including retrieval, multi-hop tracing, aggregation, and question answering.

All of this just means that the future LLM systems would have infinite context length.

Access all our open Survey & Awards Nomination forms in one place