MITB Banner

Cohere Unveils SnapKV to Cut Memory & Processing Time in LLMs

SnapKV, a new method optimises memory use and speeding up data processing, setting a new standard for LLMs

Share

Listen to this story

Researchers from Cohere, Princeton University and the University of Illinois have developed a new technique called SnapKV that efficiently compresses the key-value (KV) cache in large language models (LLMs), leading to improvements in memory efficiency and processing speed.

You can read the paper here.

The KV cache plays a crucial role in LLMs to process extensive contexts. However, as input length increases, the growth of the KV cache poses challenges to memory and time efficiency. 

Previous works have attempted to address this issue by evicting the KV cache using various algorithms, such as StreamLLM, Heavy-Hitter Oracle, and Adaptive KV Compression (FastGen). 

However, these methods either face the challenge of losing important information or focus solely on compressing the KV cache for generated tokens while overlooking the compression of the input sequence KV cache.

SnapKV takes a different approach by intelligently identifying and selecting the most important attention features per head to create a new KV cache. 

The researchers discovered that each attention head in the model consistently focuses on specific prompt attention features during generation, and this robust pattern can be obtained from an ‘observation’ window located at the end of the prompts.

The SnapKV algorithm works in two steps. First, it picks out key features from a specific part of the data through a voting method. Then, it groups these features with nearby related ones to keep the important context. In the second step, it combines these chosen features with other relevant data and compresses it. This compressed data is saved and used later to help generate responses.

The researchers evaluated SnapKV on various LLMs and long-sequence datasets, affirming its improvement over previous work and comparability to conventional KV caching. 

In the Needle-in-a-Haystack test, SnapKV achieved a remarkable ability to precisely manage small details on extremely long input contexts with a 380x compression ratio. 

The paper described, “Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens.”

Furthermore, SnapKV was integrated with a leading retrieval-augmented generation (RAG) model, showcasing its extended performance capabilities. 

The researchers also demonstrated that SnapKV could be combined orthogonally with other acceleration strategies, such as parallel decoding, to further enhance LLM efficiency.

By efficiently compressing the KV caches, this technique opens up new possibilities for the application of LLMs in real-world scenarios involving long context understanding, such as document processing and multi-round conversations.

Share
Picture of K L Krithika

K L Krithika

K L Krithika is a tech journalist at AIM. Apart from writing tech news, she enjoys reading sci-fi and pondering the impossible technologies, trying not to confuse it with reality.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.