AI Papers Reader

Personalized digests of latest AI research

View on GitHub

xKV: Novel Compression Technique Shrinks Memory Footprint of Large Language Models Without Sacrificing Accuracy

Large Language Models (LLMs) are powerful, but their massive size demands significant memory resources, especially for storing the Key and Value states (KV-Cache) during long-context inference. Now, researchers have unveiled xKV, a new post-training method that dramatically reduces the KV-Cache size of LLMs without compromising accuracy, addressing a critical bottleneck in deploying these models.

The key insight behind xKV lies in the observation that the dominant singular vectors – think of them as the most important directions – within the KV-Cache of different layers of an LLM are remarkably aligned, even though the individual token representations might differ significantly. Imagine each layer of the LLM “sees” the data from a slightly different angle, but the overall gist, the main patterns, remain consistent across these viewpoints. xKV exploits this alignment.

xKV works by applying Singular Value Decomposition (SVD) – a mathematical technique for dimensionality reduction – across groups of layers within the LLM. SVD identifies the shared, low-rank subspace that captures the essence of the KV-Cache across these layers. Instead of storing separate KV-Caches for each layer, xKV consolidates them into a shared, much smaller representation. It’s akin to creating a concise summary that captures the core information from multiple detailed documents, saving storage space without losing the crucial details.

The researchers tested xKV on the RULER benchmark, using widely-used LLMs like Llama-3.1 and Qwen2.5. The results were impressive: xKV achieved up to 6.8x higher compression rates compared to existing inter-layer compression techniques. Notably, the technique even improved accuracy by 2.7% in some cases.

To illustrate, consider a scenario where an LLM needs to analyze a long document. Traditionally, each layer of the model would maintain its own KV-Cache, leading to substantial memory consumption. With xKV, several layers can share a compressed KV-Cache, significantly reducing the overall memory footprint.

Further demonstrating its versatility, xKV also works with emerging Multi-Head Latent Attention (MLA) architectures, such as DeepSeek-Coder-V2. It achieves a 3x compression rate on coding tasks without any performance degradation. MLA already employs techniques to reduce KV-cache size, but xKV can still add further compression gains.

“The ability to drastically reduce memory bottlenecks opens the door for wider adoption of long-context LLMs,” explains Chi-Chih Chang, a lead author of the study. “Applications that were previously limited by memory constraints, such as processing massive codebases or performing extensive information retrieval, become more feasible with xKV.”

The code for xKV is publicly available, allowing other researchers and developers to readily implement and experiment with this promising compression technique. This paves the way for building even more efficient and accessible LLMs in the future.