DeltaBench: New Dataset Exposes Limitations in LLM's Error Detection Abilities
New benchmark reveals that even advanced language models struggle to identify errors in their reasoning chains.
[City, State] – [Date] A new dataset called DeltaBench has been released to rigorously test the ability of Large Language Models (LLMs) to detect errors in their own “Chain-of-Thought” (CoT) reasoning. The paper associated with the dataset is currently available on ArXiv. LLMs use CoT to break down complex problems into a series of smaller, more manageable steps. This process enhances their reasoning capabilities, but the quality of these reasoning chains and the models’ ability to self-critique them remain largely unexplored.
The researchers behind DeltaBench, affiliated with Alibaba Group and the Chinese Academy of Sciences (CASIA), found that even cutting-edge LLMs, like GPT-4, struggle to identify errors in long CoT reasoning processes.
“We found that existing LLMs have a limited ability to effectively identify errors in long Chain-of-Thought reasoning,” said Yancheng He, one of the paper’s lead authors from Alibaba Group. “For example, even GPT-4-turbo-128k, achieves an F1-score of only 40.8% on our benchmark.”
DeltaBench includes a diverse set of 1,236 problems spanning various domains, including mathematics, programming, physics, chemistry, biology, and general reasoning. For each problem, the dataset contains CoT reasoning chains generated by different LLMs. Researchers then annotated each step of these chains, noting reasoning usefulness, correctness, and reflection efficiency.
Concrete Examples Illustrate the Challenges
The paper highlights several key challenges faced by LLMs in detecting errors:
- Fundamental Errors Persist: Even in state-of-the-art models, basic errors like calculation mistakes, syntax errors, and formatting issues remain common. For instance, in one example involving a programming task, a model might correctly identify the need to import the NumPy library but then make a syntax error when creating the array. This type of error is surprisingly common, occuring between 23 and 25 percent of the time for the models examined.
- Ineffective Reflection: The proportion of effective reflection, where the model revisits and corrects its reasoning, is low. The authors note that a majority (67.8%) of the “reflections” observed in the collected long CoT responses are ultimately useless.
- Redundant Reasoning: Existing models often exhibit redundant reasoning, repeating steps or providing unnecessary information. In fact, approximately 27% of reasoning sections in the collected long CoT responses are redundant.
The dataset also allowed researchers to evaluate the performance of Process Reward Models (PRMs), which are designed to assess the quality of each step in a reasoning chain. The findings suggest that current PRMs struggle to identify errors effectively. The researchers also found that models which were theoretically more powerful like the Qwen series models, often performed worse than non- o1-like models.
Implications and Future Directions
The DeltaBench dataset provides a valuable resource for the research community to better understand the strengths and weaknesses of LLMs in reasoning and self-critique. By identifying the limitations of existing models, developers can focus on improving their error detection capabilities.
“We hope DeltaBench can guide developers to better understand the long Chain-of-Thought reasoning abilities of their models and ultimately build more reliable and trustworthy AI systems,” He concluded.
The researchers are open-sourcing the dataset, hoping to spur further innovation in this critical area of LLM development. DeltaBench is available at https://github.com/OpenStellarTeam/DeltaBench
.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.