AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Thinking Too Much Can Hurt: Large Language Models Struggle with Inference-Time Reasoning When Humans Do Too

A new study published in the preprint server arXiv sheds light on a curious limitation of large language models (LLMs): they can struggle with inference-time reasoning, a technique that often helps them perform better on tasks, when that same technique makes humans worse.

The study, titled “Mind Your Step (By Step): Chain-of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse,” explores six well-studied types of tasks where humans perform worse after engaging in verbal thinking or deliberation. The authors hypothesized that LLMs would also struggle on these tasks if they are prompted to reason out loud using the chain-of-thought (CoT) prompting technique.

The authors found that, indeed, LLMs did worse on tasks involving:

However, the authors also found that CoT did not harm performance on tasks where humans struggle with thinking, but where LLMs’ cognitive abilities are different. These include:

The study emphasizes the need to consider the specific cognitive processes of LLMs when evaluating the effectiveness of CoT prompting. The authors suggest that a deeper understanding of how humans think and make decisions could help identify situations where using CoT might hinder model performance.

This research highlights a fascinating mismatch between human and artificial cognition, and its implications extend beyond LLMs. As these models become more sophisticated, understanding their cognitive limitations, as well as their strengths, will be critical for developing and using them effectively. By considering the role of thinking in both humans and AI, we can pave the way for more robust and reliable AI systems in the future.