AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Reinforcement Learning for LLMs: Is it Memorization or True Reasoning?

A recent study has uncovered a critical issue with how large language models (LLMs) are evaluated on mathematical reasoning tasks, particularly the Qwen2.5 model family. The research suggests that the impressive performance gains reported for these models, often achieved through reinforcement learning (RL), might be inflated due to data contamination in commonly used benchmarks. This means that the models may have already “seen” and memorized the benchmark problems and their solutions during their extensive pre-training, leading to an overestimation of their actual reasoning abilities.

The paper, “Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination,” highlights that while Qwen2.5 models exhibit strong mathematical skills, their performance on benchmarks like MATH-500 might be due to memorization rather than genuine understanding. For example, when presented with only a portion of a MATH-500 problem, the Qwen2.5 model could accurately reconstruct the missing parts and provide the correct answer. This ability to “cheat” by recalling memorized data is a significant concern for reliable evaluation.

To address this, the researchers introduced a novel benchmark called RandomCalculation. This dataset consists of entirely synthetic arithmetic problems generated algorithmically, ensuring they were not part of the models’ pre-training data. When Qwen2.5 models were tested on this “clean” benchmark, their performance on complex, multi-step calculations showed that they still have significant room for improvement.

Crucially, the study also investigated the impact of different reward signals in RL. They found that on the RandomCalculation dataset, only accurate and well-aligned reward signals led to consistent performance improvements. In contrast, using random or incorrect rewards, which had previously shown spurious gains on contaminated benchmarks for Qwen2.5, resulted in unstable training and failed to yield reliable improvements. This strongly suggests that the previous “breakthroughs” with noisy rewards were largely a byproduct of data contamination, allowing the models to simply retrieve memorized answers rather than learn to reason.

The findings underscore the importance of using uncontaminated datasets and rigorous evaluation methods to accurately assess the reasoning capabilities of LLMs. The paper recommends that future research should prioritize clean benchmarks and test across a variety of model families to ensure that reported improvements truly reflect enhanced reasoning abilities and not just sophisticated memorization.