AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Overthinking in Large Language Models: Researchers Find and Fix Inefficiency in 'Chain-of-Thought' Reasoning

Large language models (LLMs) like OpenAI’s “o1” have achieved remarkable results in complex reasoning tasks. Their success stems partly from their ability to mimic human-like “chain-of-thought” (CoT) reasoning, exploring multiple solution paths and double-checking their work. However, a new paper from Tencent AI Lab reveals a significant hidden cost: these models often “overthink,” expending excessive computational resources on simple problems. This overthinking, the researchers argue, is inefficient and can be mitigated without sacrificing accuracy.

The Overthinking Problem

The researchers define “overthinking” as the unnecessary expenditure of computational resources (measured in “tokens,” or units of text generated) on problems that are easily solvable. They illustrate this with the simple equation “2 + 3 =?”. While conventional LLMs might answer this with a concise “5,” o1-like models often generate lengthy, multi-step explanations, sometimes producing multiple solutions to the same problem. Figure 2 in the paper, for example, shows one model generating 13 different solutions for this simple addition problem, using a massive number of tokens compared to a conventional model.

This overthinking is not limited to simple arithmetic. The Tencent team performed extensive experiments across various mathematical datasets (GSM8K, MATH500, GPQA, and AIME), demonstrating that o1-like models consistently overthink, particularly on easier problems. Their analysis reveals two key inefficiencies: (1) subsequent solutions rarely improve accuracy, and (2) many solutions lack diversity, with models repeating similar reasoning paths redundantly.

Measuring and Mitigating Overthinking

To quantify this inefficiency, the researchers introduce two new metrics: “outcome efficiency” and “process efficiency.” Outcome efficiency measures how effectively additional tokens contribute to achieving the correct answer, penalizing solutions that don’t improve accuracy. Process efficiency assesses the diversity of reasoning methods employed, rewarding solutions that offer novel approaches.

To address overthinking, the researchers propose several self-training strategies, leveraging the model’s own output to create training data. These include length preference optimization, which encourages shorter responses; and response simplification methods, such as selecting only the earliest correct solution (“First-Correct Solutions”) or combining the first correct solution with a second, reflective solution (“FCS+Reflection”).

Experimental Results

The results show that their self-training methods are effective in reducing both the number of tokens and the number of solutions generated, leading to significant improvements in outcome and process efficiency, particularly using the “FCS+Reflection” method. Figure 7 in the paper shows a substantial reduction in the number of tokens generated across varying difficulty levels while maintaining accuracy. Importantly, the researchers demonstrate that these improvements generalize to more challenging datasets like GPQA and AIME, suggesting that the proposed methods improve efficiency without sacrificing the ability of LLMs to tackle complex problems.

Implications and Future Work

This work is a crucial step toward optimizing the resource efficiency of advanced LLMs. The findings highlight the limitations of simply scaling computational resources during inference and propose concrete methods for improving the efficiency of chain-of-thought reasoning. Future work will likely focus on developing more sophisticated and adaptable methods for controlling the computational cost of inference while maintaining the powerful reasoning capabilities of these advanced models.