Longer "Thinking" Doesn't Necessarily Improve Reasoning Models on Fact-Heavy Tasks
In the pursuit of more intelligent AI, researchers have explored “test-time scaling” – essentially asking large language models to “think harder” by generating longer reasoning chains before providing an answer. This technique has shown promise in various domains. However, a new study reveals that for knowledge-intensive tasks, where factual accuracy is paramount, this approach might not be as effective as hoped, and in some cases, can even lead to more errors.
The research, published in a preprint by James Xu Zhao and colleagues, investigated the impact of increasing inference time computation on 12 different reasoning models across two knowledge-intensive benchmarks. The core question: does more thinking time equate to better factual recall and fewer hallucinations (making up incorrect information)?
The findings challenge the common assumption that more computation always leads to better results. Across most of the models tested, extending the reasoning process did not consistently improve accuracy. In fact, for several models, including GPT-5 mini and Gemini 2.5 Flash, longer reasoning actually led to an increase in hallucinations.
How More Thinking Can Lead to More Errors
The study delves into why this counterintuitive outcome occurs. The researchers found that the changes in hallucination behavior are largely driven by the model’s willingness to answer more questions.
-
Fewer Hallucinations through Abstinence: For some models, like Grok-3 mini, when they hallucinated less with more thinking, it was primarily because they became more likely to say “I don’t know” when unsure. This suggests a more cautious approach, abstaining from answering rather than recalling facts more accurately. For example, if a model is asked about a historical event, and with more thinking it realizes it’s uncertain, it might choose to abstain rather than guess.
-
More Hallucinations from Increased Attempts: Conversely, when hallucinations increased, it was often because the models, with more processing time, attempted to answer questions they previously wouldn’t have. The study found that for models like gpt-oss-20b, a significant portion of these new hallucinations stemmed from questions that the model had originally abstained from answering.
Confirmation Bias and Overconfidence
A particularly concerning observation from case studies is the potential for extended reasoning to induce “confirmation bias.” This is where a model, after tentatively forming a belief, searches for or even fabricates supporting details, reinforcing its initial, potentially incorrect, assumption. For instance, a model might initially guess an award year for a professor, then proceed to “find” evidence to support that specific year, even if it’s wrong, leading to an overconfident, incorrect answer.
Is Thinking Still Beneficial?
Despite these limitations, the research does indicate that enabling “thinking” modes in these models, compared to a purely “non-thinking” mode, can still be beneficial. In a separate evaluation, models that could natively support both modes showed improved accuracy, particularly on tasks requiring multi-hop reasoning, and generally reduced hallucinations. This suggests that while simply increasing computation might not be the key, the reasoning process itself, when managed effectively, can still offer advantages.
In essence, while test-time scaling has been a powerful tool for improving AI capabilities, its application to knowledge-intensive tasks requires careful consideration. The study concludes that simply giving models more time to “think” isn’t a guaranteed path to factual accuracy and could, in some scenarios, lead them down a path of increased errors and overconfidence.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.