Compressed Chain of Thought: Speeding Up Large Language Model Reasoning
Large language models (LLMs) are revolutionizing many fields, but their reasoning abilities often lag behind human capabilities. One promising technique to improve reasoning is “Chain of Thought” (CoT), where the model simulates human thought processes by breaking down complex problems into smaller, more manageable steps. However, CoT comes at a significant cost: drastically increased computation time. A new paper, “Compressed Chain of Thought: Efficient Reasoning through Dense Representations,” tackles this problem head-on.
The core issue with CoT is its reliance on generating lengthy, explicit reasoning chains using discrete language tokens. Each token requires a separate decoding step, dramatically slowing down the process. For example, the authors note that GPT-4 takes almost ten times longer to answer a question using CoT than without it.
To address this, the authors introduce “Compressed Chain of Thought” (CCoT). Instead of generating a full reasoning chain, CCoT generates a compressed representation of this chain using continuous embeddings (dense vectors), resulting in significantly fewer tokens. These compressed representations capture the essence of the reasoning process without explicitly spelling it out word-by-word.
Think of it like this: a traditional CoT approach might meticulously write out each step of solving a math problem: “First, find the area of the rectangle… Next, calculate the circumference of the circle… Finally, add the two areas together…” CCoT, on the other hand, would generate a compact numerical summary encoding the same information, greatly reducing the number of tokens that need to be processed.
The authors achieve this compression by training a dedicated module (“CCOT”) that learns to map full reasoning chains (as hidden states within the LLM) to their compressed counterparts. This module is trained using teacher forcing; that is, it learns to predict the compressed representations based on the full reasoning chain generated by the main LLM. The resulting compressed tokens are then used by a separate decoder module, trained to produce final answers based on both the original question and the compressed reasoning information.
A key advantage of CCoT is its adaptability. By adjusting a compression ratio (effectively controlling the level of compression), researchers can trade off accuracy and speed. Higher compression ratios lead to faster response times but potentially slight accuracy reductions. Conversely, lower compression ratios give more accuracy but increase computation time. This flexibility allows users to tune the system to their specific needs.
The authors evaluated CCoT on the GSM8K dataset, a benchmark for mathematical reasoning. Their results demonstrate substantial improvements: They achieve comparable accuracy to the uncompressed CoT baseline but with significantly faster inference speeds. For example, they obtain a 9-point improvement in accuracy with only a 0.4-second increase in processing time. In contrast, a method using existing “pause tokens” exhibited minimal improvement in accuracy while requiring similar computational times.
CCoT presents a powerful approach to enhancing the reasoning abilities of LLMs without sacrificing speed. The ability to control the compression ratio offers a crucial advantage, providing a flexible trade-off between performance and efficiency that adapts to diverse applications. This work suggests a promising path towards building faster and more capable reasoning systems based on LLMs.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.