Listen to this story
Sometime around in the recent past, AI research stopped obsessing over model size and set its eyes on something called context size. The model size debate has been settled for now – smaller LLMs trained on much more data have eventually proven to be better than anything else that we know of. What does context size do then, and why has it suddenly become so important?
Why is context length important?
Well, the interest in context length isn’t necessarily sudden. Since the transformer architecture became more popular, a small section of research has been working around increasing the sequence length to improve the accuracy of a model’s responses. But since LLMs like ChatGPT are now on the verge of being integrated into enterprises, the matter of improving these tools have become far more grave.
If the model is able to take an entire conversation into consideration, it has clearer context and is able to generate a more meaningful and relevant response. This essentially means a model has a long context strategy. On the other hand, if a model is able to load only the part of a conversation that is essential to finish a task, it has a short context strategy.
GPT’s context length limitation
For all the magical things that OpenAI’s models can do, ChatGPT was limited to a context length of 4,096 tokens. This limit was pushed to 32,768 tokens only for a limited-release full-fat version of the seminal GPT-4 . To translate this in terms of the word limit, would mean to stick to a length of 3,000 words. Or in other words, if you were to cross this word limit while asking a query, the model would simply lose its mind and start hallucinating.
For instance, when asked to do a spell check on a chunk of 2,000 words, ChatGPT was able to process between 800-900 words. After this, it paused and started hallucinating. The tool started offering its own unrelated questions and answering them of its own accord.
But as enquiries to solve the context length problem start flooding platforms, some have partially figured out how to go about this.
OpenAI rival, Anthropic AI has opened up the context window massively with its own chatbot Claude, pushing it to sound 75,000 words or 100,000 tokens. And as a blog posted by the startup stated, that’s enough to process the entire copy of The Great Gatsby in an attempt. Claude was able to demonstrate this — it was asked to edit one sentence in the novel by spotting the change in 22 seconds.
A couple of days back, Salesforce announced the release of a family of open-source LLMs called CodeT5+, which it said was contextually richer since it wasn’t built on the GPT-style of design.
The blog posted by Salesforce made things clearer by placing the blame squarely on the imperfections of autoregressive models. “For instance, decoder-only models such as GPT-based LLMs do not perform well in understanding tasks such as defect detection and code retrieval. Quite often, the models require major changes in their architectures or additional tuning to suit downstream applications.”
Instead, Salesforce designed a flexible encoder-decoder architecture which was more scalable and could “mitigate the pretrain-finetune discrepancy.”
Solving the context length problem
Five days back, Meta AI’s research team released a paper titled, ‘MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers,’ that proposed a new method to address the context length problem. “Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books,” it stated.
MEGABYTE, a new multiscale decoder architecture, was an end-to-end differentiable modelling of sequences of more than one million bytes. The model was able to segment sequences into separate patches and then use a local submodel within these patches and a global model between them.
The main advantage that this architecture held over self-attention transformers was also cost. MEGABYTE was able to reduce the cost by a fair bit by, “allowing far bigger and more expressive models at the same cost by using huge feedforward layers per patch rather than per position”.
The giant costs of tokenisation in transformers raises the big question if the money is eventually even worth it. Even Anthropic’s Claude which can process 100,000 tokens will possibly be costly. For example, GPT-4’s 32k context length costs USD 1.96, which is steep considering these tools aim to be used for all kinds of general purpose tasks across organisations.
For a chatbot that is seeking to be as intelligent as a human, context is everything. Because without that, a chatbot with the memory of a goldfish won’t amount to much more than what it is now.