Busting the Myth of Context Length

In the push for making chatbots as smart as humans, we are definitely also making them as dumb as humans

Share

Published on July 12, 2023

by Mohit Pandey

Listen to this story

Now that we have smaller models such as LLaMA and Falcon that are performing similar to GPT-4 or PaLM in certain cases, the conversation has shifted from increasing the number of parameters to the number of context tokens or context length in these models.

In essence, context length is the ability of an LLM to respond to a prompt perfectly, as it needs clarity of the entire context that the question has been put in.

Often, people have this notion that when the input word count is more, the output would eventually be perfect. But, in reality, that is not the case. Say, you input an article of 2000 words on ChatGPT, it starts to make sense of it till it reaches a 700-800 word mark, then starts hallucinating. That’s the truth.

This is pretty much similar to how memory or short term memory works for humans. But is it really the case that context length is all that matters?

Attention is indeed all you need

Take listening to a story or watching a movie, for example. In most cases, the introductory part and the ending is what the audience remembers the most and the part in the middle often has the least recall value. Jim Fan from NVIDIA AI and a Stanford PhD holder explains that this is exactly what LLMs are going through.

In his tweet, taking basis from the recent paper from Stanford researchers — Lost in the Middle: How Language Models Use Long Contexts — Fan explains how claims of a million or billion tokens are not helpful when it comes to improving LLMs. “What truly matters is how well the model actually uses the context. It’s easy to make seemingly wild claims, but much harder to solve real problems better,” he said.

The paper explains how models are good at retaining the information present in the beginning and the end of the context, but not what is present in the middle. This is the same with all LLMs that are being currently developed including GPT, PaLM, or Flan-T5.

Moreover, models that have a natively longer context also fail to actually use the context better. In the paper, the researchers demonstrate how both the versions of GPT-3.5, one with 4k and the other with 16k context length, demonstrate similar results and the performance decreases as the context grows longer.

Ahmed Moubtahij from Computer Research Institute of Montreal adds that this might possibly be because of the training examples and the issue with input data. Most of these models are trained on internet data with pages such as news articles that have the most important information at the beginning and at the end. This results in the output of LLMs also presenting the same architecture.

Stupidity like humans

Ever since Transformers was introduced with the Attention is All You Need paper, context length has been discussed excessively in every LLM release. It has always been believed that increasing the sequence length would improve the accuracy of the models. But indeed, just like humans forget half the story midway, LLMs are showcasing similar capabilities or possibly inabilities.

One thing that is certain is that in the push for making chatbots as smart as humans, we are definitely able to make them as dumb as humans. Maybe that is all we need even if we don’t want to. The similarity between human brains and Transformers is astonishing.

In discussions on HackerNews, Reddit, and Twitter on the same topic, users shared how increasing the number of tokens is becoming laughable at this point. “I’ve noticed this with GPT-4. It’ll ignore some part of its context, and when I point it out, it knows, so it’s clearly still in its context, but it didn’t know it has to look it up for a particular answer. We also have the same problem with memory, so I empathise.”

Moreover, if LLM providers through APIs are charging dollars per token, increasing the number of context tokens just to earn more money only makes more sense for them. More research would definitely prove if it does make sense to add more context tokens.

The giant costs of tokenisation in transformers makes one question if the money will eventually even be worth it. Anthropic’s Claude, which has the highest token count of 100k, will possibly be very costly if we take for example, GPT-4’s 32k context length cost of USD 1.96 per token.

For now, LLMs, like us humans, have a curious habit of remembering the story’s beginning and end with flair while casually dismissing the messy middle part. These models exhibit a common tendency — the longer the context, the higher the likelihood of their stumbling. It’s almost as if they suffer from a case of “attention deficit context disorder”.

Access all our open Survey & Awards Nomination forms in one place