Listen to this story
MIT and the Chinese University of Hong Kong have come up with LongLoRA – a fine-tuning method that increases the context capacity of large pre-trained language models without requiring excessive computational resources. Training LLMs with extended context sizes is typically costly in terms of time and GPU usage. For example, training a model with an 8192-length context demands 16 times the computational resources compared to a 2048-length context. Context length refers to the ability of a Large Language Model (LLM) to respond effectively to a given prompt, as it requires a clear understanding of the entire context in which the question is posed.
Read the full paper here.
Researchers accelerated the broadening of the LLM context through two significant approaches. First, they employed sparse local attention, specifically the shift short attention (S2-Attn) approach, during fine-tuning, facilitating context extension efficiently, resulting in substantial computational savings while maintaining performance similar to fine-tuning with standard attention.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Second, the researchers re-examined the parameter-efficient fine-tuning strategy for context expansion. Their findings suggested that LoRA was effective for context extension when combined with trainable embeddings and normalization. LongLoRA delivered robust empirical results across a range of tasks using LLaMA2 models, spanning from 7B/13B to 70B. LongLoRA could extend a model’s context from 4k to 100k for LLaMA2 7B or from 32k for LLaMA2 70B, all achievable on a single 8× A100 machine. Importantly, LongLoRA maintained the original model architectures and was compatible with various existing techniques, such as FlashAttention-2.
To enhance the practicality of LongLoRA, the team created the LongQA dataset for supervised fine-tuning, consisting of over 3,000 question-answer pairs with lengthy contexts.
Long-sequence Language Modeling: The study evaluated different models on Proof-pile and PG19 datasets. It found that, with longer context sizes during training, the models performed better, showing the effectiveness of their fine-tuning method. In simpler terms, training with more information led to better results. For example, when the context window size increased from 8192 to 32768, one model’s performance improved from 2.72 to 2.50 in terms of perplexity.
Maximum Context Length: The study also explored how much context these models could handle on a single machine. They extended the models to handle extremely long contexts and found that the models still performed well, although there was some drop in performance with smaller context sizes.
Retrieval-based Evaluation: In addition to language modeling, the study tested the models on a task where they had to find specific topics in very long conversations. Their models performed similarly to the state-of-the-art model in this task, even outperforming it in some cases. Notably, their models were adapted more efficiently to open-source data compared to the competition.
How does Context Length Matter?
In recent discussions about language models like LLaMA and Falcon, which can perform similarly to larger models such as GPT-4 or PaLM in specific cases, the focus has shifted from increasing the number of model parameters to considering the number of context tokens or context length.
AIM reported earlier that contrary to the misconception that longer input text leads to better output, in reality, when inputting a lengthy article (e.g., 2000 words) into models like ChatGPT, they tend to make sense of the content up to around 700-800 words before starting to generate less coherent responses. This phenomenon is similar to how human memory works, with the beginning and end of information being better retained than the middle.
Read more: Busting the Myth of Context Length