Thinking Too Much Can Hurt: Large Language Models Struggle with Inference-Time Reasoning When Humans Do Too

🔊

💬 Ask

A new study published in the preprint server arXiv sheds light on a curious limitation of large language models (LLMs): they can struggle with inference-time reasoning, a technique that often helps them perform better on tasks, when that same technique makes humans worse.

The study, titled “Mind Your Step (By Step): Chain-of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse,” explores six well-studied types of tasks where humans perform worse after engaging in verbal thinking or deliberation. The authors hypothesized that LLMs would also struggle on these tasks if they are prompted to reason out loud using the chain-of-thought (CoT) prompting technique.

The authors found that, indeed, LLMs did worse on tasks involving:

Implicit statistical learning: When asked to classify strings generated by an artificial grammar, LLMs that used CoT performed significantly worse than those that didn’t. This replicates human findings, where explicit thinking about the rules hurts performance on such tasks.
Visual recognition: LLMs that used CoT struggled to identify matching faces, echoing human limitations in verbal overshadowing.
Classifying data with exceptions: LLMs that used CoT needed more iterations to learn a simple classification rule with exceptions, similar to human participants who tend to rely on generalizable rules even when faced with exceptions.

However, the authors also found that CoT did not harm performance on tasks where humans struggle with thinking, but where LLMs’ cognitive abilities are different. These include:

Logical reasoning: LLMs that used CoT actually improved their performance on tasks involving logical inconsistencies, while humans struggle with such tasks after explaining their reasoning.
Spatial intuition: LLMs performed similarly on tasks involving spatial reasoning, regardless of whether they used CoT, suggesting that they don’t rely on verbal thinking the same way humans do.
Working memory: LLMs performed better on tasks involving complex decisions and a lot of information, with CoT actually helping them. This is because LLMs have access to long context windows, unlike humans.

The study emphasizes the need to consider the specific cognitive processes of LLMs when evaluating the effectiveness of CoT prompting. The authors suggest that a deeper understanding of how humans think and make decisions could help identify situations where using CoT might hinder model performance.

This research highlights a fascinating mismatch between human and artificial cognition, and its implications extend beyond LLMs. As these models become more sophisticated, understanding their cognitive limitations, as well as their strengths, will be critical for developing and using them effectively. By considering the role of thinking in both humans and AI, we can pave the way for more robust and reliable AI systems in the future.

AI Papers Reader

Personalized digests of latest AI research

Thinking Too Much Can Hurt: Large Language Models Struggle with Inference-Time Reasoning When Humans Do Too

Chat about this paper