AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LogQuant: New Method Shrinks Large Language Model Memory Footprint While Boosting Accuracy

Researchers at the National University of Singapore and 4Paradigm have introduced LogQuant, a novel technique to significantly reduce the memory footprint of Large Language Models (LLMs) during inference, all while surprisingly improving accuracy in certain complex tasks. The team’s findings, detailed in a recently published paper, address a critical challenge in deploying increasingly large and computationally intensive LLMs: the growing demand for memory.

The core of LogQuant lies in how it quantizes the Key/Value (KV) cache. The KV cache is a component of LLMs that stores information about past tokens (words) in a sequence, allowing the model to “remember” context and generate coherent text. As LLMs process longer sequences, the KV cache grows linearly, consuming significant memory.

LogQuant addresses this issue by selectively compressing the KV cache using a novel 2-bit quantization technique, meaning that each piece of information is represented using only 2 bits instead of the usual 16 or more. However, not all tokens are created equal. The method uses a clever “log-based filtering” approach. The idea is that recent tokens (closer to the token currently being generated) are likely to be more important than those further back in the sequence history. They found that “attention spikes,” representing important tokens, follow a log distribution with respect to their distance from the most recent token. So, instead of treating all tokens the same, LogQuant prioritizes preserving the precision of more recent tokens while aggressively compressing older, less “relevant” ones.

“Imagine you’re summarizing a long document,” explains Dr. Han Chen, lead author of the study. “You’re more likely to refer back to the last few sentences you read than something from the very beginning. LogQuant mimics this process.”

The results are compelling. In benchmark tests, LogQuant increased throughput by 25% and boosted batch size (the number of sequences processed simultaneously) by 60% without increasing memory consumption. Even more impressive, in challenging tasks like math and code completion, LogQuant improved accuracy by a staggering 40% to 200% compared to existing compression techniques, at the same compression ratio. Existing methods have struggled with importance identification, where a “window-based” strategy risks missing distant but important tokens while “attention-based” approaches suffer prediction errors from historical scores.

“For tasks like solving complex math problems or generating functional code, retaining the correct contextual information is crucial,” says Dr. Chen. “LogQuant’s ability to preserve this information, even while drastically reducing memory, leads to these significant accuracy gains.”

The researchers also highlight the ease of integration of LogQuant with popular inference frameworks like Python’s transformers library, making it readily accessible to developers looking to optimize their LLM deployments. The implementation is available on GitHub.

LogQuant marks a significant step forward in making LLMs more accessible and efficient, potentially unlocking new possibilities for deployment in resource-constrained environments and enabling more sophisticated applications requiring longer context windows.