New Evaluation Framework "REST" Uncovers Hidden Weaknesses in Large Language Models

🔊

💬 Ask

New York, NY – July 16, 2025 – A groundbreaking study published today introduces a novel evaluation framework, dubbed REST (Reasoning Evaluation through Simultaneous Testing), that challenges the current understanding of large reasoning models’ (LRMs) capabilities. Developed by researchers at Tsinghua University and OpenDataLab, REST reveals significant performance degradation in even the most advanced LRMs when faced with multiple complex problems presented concurrently, a scenario more reflective of real-world applications.

For years, the dominant approach to evaluating LRMs has been to present them with one question at a time. While this method has proven effective for early model development, the paper argues that it has become insufficient due to the rapid advancements in LRMs. The primary limitations identified are:

Vulnerability to Data Contamination and Saturation: Existing benchmarks are becoming “too easy” for state-of-the-art models, leading to saturated performance scores. This forces the constant, costly creation of new, more difficult datasets, a cycle the researchers liken to “benchmark obsolescence.” For example, a model achieving 97% accuracy on a benchmark might still have significant room for improvement in more demanding situations.
Failure to Assess Multi-Contextual Pressure: Real-world applications, such as AI tutors or technical support systems, often require models to handle multiple, potentially interfering questions simultaneously. The single-question approach fails to capture a model’s ability to manage this “multi-context pressure,” including prioritizing information, resisting interference between problems, and dynamically adjusting cognitive load.

REST tackles these limitations by transforming existing benchmarks into more challenging variants. It achieves this by concatenating multiple questions into a single prompt, effectively stress-testing the LRMs’ ability to handle increased reasoning loads. The framework evaluates several under-tested capabilities, such as contextual priority allocation and cross-problem interference resistance.

The study’s evaluation of 34 advanced LRMs across seven reasoning benchmarks yielded striking findings. Even top-performing models like DeepSeek-R1, which excel in single-question evaluations, exhibited a substantial performance drop – a decrease of up to 29.1% on certain challenging benchmarks like AIME24. This challenges the common assumption that LLMs are inherently “multi-problem solvers.”

Furthermore, REST demonstrated a stronger discriminative power than traditional methods. Models that performed nearly identically on single-question tests displayed significant accuracy differences when evaluated under the REST framework. This highlights REST’s effectiveness in identifying subtle but important distinctions in model capabilities.

An analysis of the performance degradation revealed a key factor: the “overthinking trap,” where LRMs tend to generate overly verbose reasoning processes even for simpler problems. Conversely, models trained with a “Long2Short” technique, which encourages more concise reasoning, performed better under REST. This suggests that developing more efficient and less verbose reasoning strategies is a promising direction for future LRM development.

The researchers emphasize that REST offers a cost-effective and scalable alternative to the continuous creation of new benchmarks. By better reflecting real-world demands, REST not only provides a more accurate assessment of current LRMs but also offers crucial insights for building more robust and capable AI systems.

AI Papers Reader

Personalized digests of latest AI research

New Evaluation Framework "REST" Uncovers Hidden Weaknesses in Large Language Models

Chat about this paper