Chain-of-Thought (CoT) Is Mostly Helpful for Math and Symbolic Reasoning
💬 Ask
A new meta-analysis of 100+ papers using chain-of-thought prompting for large language models (LLMs) finds that the popular technique works best on tasks involving math and symbolic reasoning, with much smaller gains on other types of tasks.
Chain-of-thought prompting involves asking a language model to write down its reasoning steps before giving an answer. This has been shown to significantly improve model accuracy on many tasks, particularly in the domain of mathematics.
However, a new study from the University of Texas at Austin and collaborators, published in arXiv, finds that CoT is not a universal cure-all for reasoning problems. The researchers conducted a meta-analysis of existing studies on CoT and ran their own evaluations on 20 datasets using 14 LLMs. Their results show that CoT consistently boosts performance on tasks requiring mathematical or logical reasoning. For example, on the GSM8K math dataset, CoT prompts produced gains of up to 66.9% over direct answer prompts.
In contrast, CoT provided much smaller improvements on other types of reasoning tasks, such as those involving common sense reasoning or knowledge retrieval.
“Our evidence does not support the conjecture that CoT will outperform direct answering on nearly all reasoning problems, whether those problems involve symbolic or non-symbolic reasoning,” says Zayne Sprague, a graduate student at UT Austin and lead author of the paper.
The researchers identified a key factor in CoT’s success: the presence of an equals sign (“=”). This suggests that CoT primarily benefits problems that can be broken down into two stages:
-
Planning: Parsing the problem into symbolic equations or formal specifications.
-
Execution: Performing symbolic manipulations to solve those equations.
When CoT is helpful, LLMs are typically better at performing the execution stage, but still fall short of what models using tool augmentation can achieve. For example, on the GSM8K dataset, prompting an LLM to generate a Python program to solve a problem and then executing that program using a Python interpreter significantly outperforms prompting the LLM to solve the program step-by-step using CoT.
The findings suggest that CoT can be used selectively, maintaining performance while saving inference costs. They also highlight the need to move beyond prompt-based CoT to new paradigms that leverage intermediate computation across a wider range of LLM applications.
“Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. They also highlight the need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across a wider range of LLM applications,” concludes Sprague.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.