New Benchmark, RiddleBench, Exposes Catastrophic Failures in LLM Reasoning and Self-Correction

🔊

💬 Ask

A new study introducing the reasoning benchmark RiddleBench has revealed that even the most advanced Large Language Models (LLMs) suffer from deep, systemic flaws when tackling complex logic puzzles that require integrating deduction, spatial awareness, and constraint satisfaction.

While LLMs have excelled at many established tasks like structured mathematical problem-solving, researchers found that these models fail spectacularly when confronted with the kind of multifaceted puzzles that form the cornerstone of human intelligence.

RiddleBench, compiled by researchers from institutions including AI4Bharat and IIT Madras, consists of 1,737 challenging puzzles sourced from competitive Indian government examinations. These problems are categorized into Sequential Reasoning, Coding-Decoding, Blood Relations, and, most critically, Seating Arrangements.

The Problem of Integrated Logic

Unlike simple arithmetic, RiddleBench demands that LLMs synthesize multiple textual constraints to construct a holistic “mental model.”

For instance, a Blood Relations puzzle (like determining kinship across three generations) requires the model to build an internal family tree or graph, constantly updating and verifying complex constraints (e.g., “Radhey is the son-in-law of Krishna,” “Kanha is the grandmother of Hari”).

The results were sobering: Top-tier models like OpenAI’s o3, Claude 4 Sonnet, and Gemini 2.5 Pro achieved overall accuracy scores barely above 60%. Performance plummeted on Seating Arrangement tasks, where LLMs must deduce spatial layouts—a challenge that exposed their difficulty in maintaining a mutable holistic model.

The Illusion of Self-Correction

Moving beyond simple accuracy, the research investigated the reliability and robustness of LLM reasoning, revealing two profound failures.

First, researchers tested the “model-as-judge” paradigm, tasking one model (Qwen QwQ 32B) with validating the flawed logical traces of another (DeepSeek-R1). This exposed a “hallucination cascade”: the evaluator model accurately flagged the flawed logic only 44.1% of the time, often uncritically validating errors made by its peer. Verifying answers proved so difficult, 55% of the evaluation attempts timed out entirely.

Second, models demonstrated a powerful self-confirmation bias. When Qwen QwQ 32B was tasked with checking its own flawed reasoning, it failed to identify its errors in nearly 68% of trials, successfully self-correcting just 17.3% of the time. This suggests that LLMs are statistically far more likely to entrench their own errors than to correct them, casting doubt on methods that rely on iterative self-refinement.

Fragile Reasoning and Red Herrings

The LLMs’ logical processes were also found to be extremely fragile.

To test robustness, researchers introduced superficial changes to the prompts that should not impact a true logical reasoner. When the order of constraint sentences was randomly shuffled (e.g., placing the definition of B before the definition of A), model performance dropped significantly, losing up to 6.7 percentage points on Blood Relations puzzles.

Similarly, introducing a single, irrelevant “red herring” sentence into the puzzle prompt caused accuracy to drop on most categories. These results imply that models rely on brittle, sequential processing heuristics rather than robustly comprehending the underlying logical structure.

RiddleBench is being made publicly available, providing the research community with a diagnostic tool to guide the development of more reliable and robust AI systems capable of deep, integrated reasoning.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark, RiddleBench, Exposes Catastrophic Failures in LLM Reasoning and Self-Correction

The Problem of Integrated Logic

The Illusion of Self-Correction

Fragile Reasoning and Red Herrings

Chat about this paper