New Finetuning Method LongLoRA Paves the Way for Budget-Friendly Super-sized LLMs

The method increases the context capacity of large pre-trained language models without requiring excessive computational resources. 
Listen to this story

MIT and the Chinese University of Hong Kong have come up with LongLoRA – a fine-tuning method that increases the context capacity of large pre-trained language models without requiring excessive computational resources. Training LLMs with extended context sizes is typically costly in terms of time and GPU usage. For example, training a model with an 8192-length context demands 16 times the computational resources compared to a 2048-length context. Context length refers to the ability of a Large Language Model (LLM) to respond effectively to a given prompt, as it requires a clear understanding of the entire context in which the question is posed.

Read the full paper here.

Training Method

Researchers accelerated the broadening of the LLM context through two significant approaches. First, they employed sparse local attention, specifically the shift short attention (S2-Attn) approach, during fine-tuning, facilitating context extension efficiently, resulting in substantial computational savings while maintaining performance similar to fine-tuning with standard attention.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Second, the researchers re-examined the parameter-efficient fine-tuning strategy for context expansion. Their findings suggested that LoRA was effective for context extension when combined with trainable embeddings and normalization. LongLoRA delivered robust empirical results across a range of tasks using LLaMA2 models, spanning from 7B/13B to 70B. LongLoRA could extend a model’s context from 4k to 100k for LLaMA2 7B or from 32k for LLaMA2 70B, all achievable on a single 8× A100 machine. Importantly, LongLoRA maintained the original model architectures and was compatible with various existing techniques, such as FlashAttention-2.

To enhance the practicality of LongLoRA, the team created the LongQA dataset for supervised fine-tuning, consisting of over 3,000 question-answer pairs with lengthy contexts. 

Key Findings

Long-sequence Language Modeling: The study evaluated different models on Proof-pile and PG19 datasets. It found that, with longer context sizes during training, the models performed better, showing the effectiveness of their fine-tuning method. In simpler terms, training with more information led to better results. For example, when the context window size increased from 8192 to 32768, one model’s performance improved from 2.72 to 2.50 in terms of perplexity.

Maximum Context Length: The study also explored how much context these models could handle on a single machine. They extended the models to handle extremely long contexts and found that the models still performed well, although there was some drop in performance with smaller context sizes.

Retrieval-based Evaluation: In addition to language modeling, the study tested the models on a task where they had to find specific topics in very long conversations. Their models performed similarly to the state-of-the-art model in this task, even outperforming it in some cases. Notably, their models were adapted more efficiently to open-source data compared to the competition.

How does Context Length Matter?

In recent discussions about language models like LLaMA and Falcon, which can perform similarly to larger models such as GPT-4 or PaLM in specific cases, the focus has shifted from increasing the number of model parameters to considering the number of context tokens or context length.

AIM reported earlier that contrary to the misconception that longer input text leads to better output, in reality, when inputting a lengthy article (e.g., 2000 words) into models like ChatGPT, they tend to make sense of the content up to around 700-800 words before starting to generate less coherent responses. This phenomenon is similar to how human memory works, with the beginning and end of information being better retained than the middle.

Read more: Busting the Myth of Context Length

Shritama Saha
Shritama Saha is a technology journalist who is keen to learn about AI and analytics play. A graduate in mass communication, she is passionate to explore the influence of data science on fashion, drug development, films, and art.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox