Not All LLM Reasoners Are Created Equal
đź“„ Full Paper
đź’¬ Ask
Large language models (LLMs) have achieved impressive results on standardized benchmarks for math reasoning, leading many to believe they have “mastered” grade-school math. However, a new paper published by researchers at Mila, Google DeepMind, and Microsoft Research shows that this belief may be premature. The researchers found that LLMs exhibit significant reasoning gaps when tasked with solving pairs of grade-school math problems where the answer to the second problem depends on correctly solving the first.
The researchers evaluated the performance of several LLMs on a new benchmark called Compositional GSM, a modified version of the popular GSM8K benchmark. Compositional GSM pairs together two grade-school math problems, requiring the LLM to first solve the first question (Q1) and use its answer as a variable in the second question (Q2). To answer Q2 correctly, the LLM must not only solve Q1 but also correctly integrate the answer into Q2.
The study found that most LLMs exhibited a significant reasoning gap, performing worse on Compositional GSM than they did on the original GSM8K benchmark. This gap was particularly pronounced in smaller, more cost-efficient, and math-specialized models. The researchers also found that instruction-tuning, a technique used to improve the reasoning capabilities of LLMs, had varying effects across different LLM sizes, and that finetuning on GSM8K could lead to overfitting.
To illustrate the reasoning gap, consider the following example:
- Q1: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
- Q2: There are X students in Marissa’s class. Each student started the year with 10 pencils. After two months, 1/5 of the total pencils in class were used. At the end of the year, only 1/3 of the remaining pencils were left. How many pencils were left?
A model that is able to solve both questions independently might not be able to solve them together. If it does not correctly answer Q1, it will likely be unable to correctly answer Q2, even if it understands the second question.
The researchers also found that even when models correctly solved the first question, they often struggled to correctly solve the second question. This suggests that LLMs may be relying on superficial pattern recognition rather than true understanding, leading to overfitting to specific problem formats.
The study’s findings highlight the importance of evaluating LLM reasoning capabilities beyond standard benchmarks. The researchers argue that Compositional GSM provides a valuable tool for assessing the reasoning abilities of LLMs, and that further research is needed to understand why LLMs struggle with multi-hop reasoning problems.
Overall, the paper suggests that we should not assume that LLMs have “mastered” grade-school math. There are still significant gaps in their reasoning capabilities, particularly when it comes to multi-hop problems that require the integration of multiple steps. Future research should focus on developing new techniques to improve LLM reasoning abilities and on creating more robust benchmarks to evaluate these capabilities.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.