The Cost of Confidence: Why "Thinking Out Loud" is Essential for AI Math
In the race to make Large Language Models (LLMs) faster and more efficient, researchers have leaned heavily on a technique called “self-distillation.” The idea is simple: have a model act as its own teacher. When the teacher version is given the correct answer as a “hint,” it can produce a concise, perfect reasoning path. The student version then tries to mimic that directness.
However, a new paper from researchers at Microsoft and Seoul National University reveals a startling side effect: this process can make models significantly dumber at complex reasoning. In some mathematical benchmarks, the researchers observed performance plunges of up to 40%.
The Power of “Hmm”
The core of the problem lies in the suppression of what the authors call “epistemic verbalization”—the AI equivalent of thinking out loud.
When advanced models like DeepSeek-R1 tackle a difficult problem, they don’t just output a solution. They frequently use “uncertainty markers” such as “Wait,” “Hmm,” “Perhaps,” or “Let me check.” While these tokens might look like “filler” or wasted computation, the researchers argue they are actually vital signals. They represent the model’s internal process of weighing different hypotheses and catching its own mistakes.
Think of it like a mountain climber. A novice might see an expert fly up a familiar wall and try to mimic their speed. But the expert is only moving fast because they know exactly where the holds are. If the novice tries that same speed on a new, crumbling cliffside without testing the rocks—pausing to “verbalize” their uncertainty—they are likely to fall.
The Teacher’s Trap
In self-distillation, the “teacher” model is given the ground-truth solution. Because it knows the destination, it stops second-guessing itself. It stops saying “Wait” or “Hmm” and produces a short, confident path to the answer.
When the student model is trained to copy this style, it learns to be concise and confident. But at test time, when the ground-truth solution is gone, the student is stuck with a “confident” style but no internal mechanism to handle doubt. It commits to the first path it finds, even if that path is wrong, because it has been trained that “thinking out loud” is a sign of failure to be corrected.
Why Math is Different
The study highlights a fascinating divide between domains. In subjects like Chemistry, self-distillation actually works quite well. This is because many chemistry problems in the training sets are repetitive or follow highly similar patterns. In those cases, the model doesn’t need to “think”—it just needs to remember the formula.
Mathematical reasoning, however, is “Out-of-Distribution” (OOD) by nature. A competition-level math problem often requires a novel combination of rules. Across models like Qwen3-8B and Olmo3-7B-Instruct, the researchers found that as the models were forced to be more concise through self-distillation, their ability to solve these novel problems withered.
Beyond Correctness
The takeaway for the AI industry is clear: training a model only on “correct” answer traces is not enough. To build robust reasoners, we must preserve the model’s right to be unsure. The study suggests that “redundant” thinking isn’t just a stylistic quirk—it is the very foundation of an AI’s ability to navigate the unknown. Efficiency, it seems, is the enemy of exploration.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.