The Context Length Hitch with GPT Models

OpenAI rival, Anthropic AI has opened up the context window massively with its own chatbot Claude, pushing it to sound 75,000 words or 100,000 tokens.
Listen to this story

Sometime around in the recent past, AI research stopped obsessing over model size and set its eyes on something called context size. The model size debate has been settled for now – smaller LLMs trained on much more data have eventually proven to be better than anything else that we know of. What does context size do then, and why has it suddenly become so important? 

Why is context length important? 

Well, the interest in context length isn’t necessarily sudden. Since the transformer architecture became more popular, a small section of research has been working around increasing the sequence length to improve the accuracy of a model’s responses. But since LLMs like ChatGPT are now on the verge of being integrated into enterprises, the matter of improving these tools have become far more grave. 

If the model is able to take an entire conversation into consideration, it has clearer context and is able to generate a more meaningful and relevant response. This essentially means a model has a long context strategy. On the other hand, if a model is able to load only the part of a conversation that is essential to finish a task, it has a short context strategy. 

GPT’s context length limitation

For all the magical things that OpenAI’s models can do, ChatGPT was limited to a context length of 4,096 tokens. This limit was pushed to 32,768 tokens only for a limited-release full-fat version of the seminal GPT-4 . To translate this in terms of the word limit, would mean to stick to a length of 3,000 words. Or in other words, if you were to cross this word limit while asking a query, the model would simply lose its mind and start hallucinating. 

For instance, when asked to do a spell check on a chunk of 2,000 words, ChatGPT was able to process between 800-900 words. After this, it paused and started hallucinating. The tool started offering its own unrelated questions and answering them of its own accord. 

But as enquiries to solve the context length problem start flooding platforms, some have partially figured out how to go about this. 

OpenAI rival, Anthropic AI has opened up the context window massively with its own chatbot Claude, pushing it to sound 75,000 words or 100,000 tokens. And as a blog posted by the startup stated, that’s enough to process the entire copy of The Great Gatsby in an attempt. Claude was able to demonstrate this — it was asked to edit one sentence in the novel by spotting the change in 22 seconds. 

A couple of days back, Salesforce announced the release of a family of open-source LLMs called CodeT5+, which it said was contextually richer since it wasn’t built on the GPT-style of design. 

The blog posted by Salesforce made things clearer by placing the blame squarely on the imperfections of autoregressive models. “For instance, decoder-only models such as GPT-based LLMs do not perform well in understanding tasks such as defect detection and code retrieval. Quite often, the models require major changes in their architectures or additional tuning to suit downstream applications.” 

Instead, Salesforce designed a flexible encoder-decoder architecture which was more scalable and could “mitigate the pretrain-finetune discrepancy.” 

Solving the context length problem

Five days back, Meta AI’s research team released a paper titled, ‘MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers,’ that proposed a new method to address the context length problem. “Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books,” it stated. 

MEGABYTE, a new multiscale decoder architecture, was an end-to-end differentiable modelling of sequences of more than one million bytes. The model was able to segment sequences into separate patches and then use a local submodel within these patches and a global model between them. 

The main advantage that this architecture held over self-attention transformers was also cost. MEGABYTE was able to reduce the cost by a fair bit by, “allowing far bigger and more expressive models at the same cost by using huge feedforward layers per patch rather than per position”. 

The giant costs of tokenisation in transformers raises the big question if the money is eventually even worth it. Even Anthropic’s Claude which can process 100,000 tokens will possibly be costly. For example, GPT-4’s 32k context length costs USD 1.96, which is steep considering these tools aim to be used for all kinds of general purpose tasks across organisations. 

For a chatbot that is seeking to be as intelligent as a human, context is everything. Because without that, a chatbot with the memory of a goldfish won’t amount to much more than what it is now. 

Download our Mobile App

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.