AI Papers Reader

Personalized digests of latest AI research

View on GitHub

The Dominoes of Logic: Why Today’s Best AI Models Fail When the Rules Change

Artificial intelligence has conquered standardized exams, written convincing essays, and mastered coding. Yet, when faced with the strict, unyielding rules of formal logic, even the most advanced “thinking” models trip over their own feet.

A newly released study by researchers from Fudan University and Tencent introduces LLMEval-Logic, a rigorous Chinese language benchmark designed to test the limits of large language models (LLMs). The findings are a sobering reality check for the AI industry: while the best models easily pass simple logic tests, they suffer a near-total collapse when forced to track complex, changing rules. The top-performing model, Gemini 3.1 Pro, scored a meager 37.5% on the benchmark’s hardest challenges.

Moving Beyond Templates

Traditionally, AI logic tests have been generated using rigid templates—effectively translating math formulas directly into robotic-sounding text. Modern LLMs have learned to spot the statistical patterns in these templates, scoring near-perfect marks without doing any genuine reasoning.

LLMEval-Logic completely rethinks this approach. It features human-authored, realistic scenarios—such as navigating corporate schedules, decoding institutional procedures, or sorting out complex rules. Crucially, the researchers used a mathematical solver called Z3 to formally verify that every question has a mathematically watertight answer, and built detailed grading rubrics to evaluate how well models translate human language into formal logic.

The Counterfactual Trap

To understand why this benchmark is so devastating to AI, consider a scenario from the study involving a speaker deciding whether to attend a class reunion. The decision depends on a web of interlocking rules:

  • If “Xiaomai” is definitely going, the speaker goes regardless of meetings.
  • If the venue is close and there are no meetings, the speaker goes.
  • If a meeting is scheduled, it blocks certain other rules from triggering.

In a baseline test, an AI might correctly determine the speaker’s choices. However, LLMEval-Logic’s “Hard” subset uses an automated, adversarial pipeline to twist these scenarios using “counterfactual” updates. For instance: “Suppose the afternoon client meeting is cancelled, and there is no other important meeting. Now, what are the possible outcomes?”

To answer correctly, a model cannot simply swap out one fact. It must perform a global recomputation. It has to reopen every single branch of the logic tree, re-evaluating how the cancellation cascades through the other rules—such as how it affects Xiaomai’s status and old-friend aliases.

Instead, LLMs fall into a “local patching” trap. Like a human trying to fix a complex spreadsheet by changing a single number without updating the formulas, the models patch the cancelled meeting locally but keep outdated conclusions that were only true in the original scenario.

A Startling Ranking Inversion

The study evaluated 14 frontier LLMs, including GPT-5.4 Pro and Claude Opus 4.6, revealing a dramatic gap between simple and adversarial reasoning.

On the single-question “Base” tests, models like Seed 2.0 Pro and Hy3 preview sat comfortably at the top, scoring over 75%. Yet, on the “Hard” multi-step tests, their performance plummeted to just 20.4% and 21.6%, respectively. Meanwhile, Claude Opus 4.6, which ranked lowest on the simple tests, rose to second place on the hard tests.

This ranking flip proves that high accuracy on single-shot questions does not translate to reliable, sustained reasoning. When AI is deployed in high-stakes environments—such as reviewing legal contracts, checking medical guidelines, or auditing software code—a single missed logical cascade can have disastrous real-world consequences. LLMEval-Logic shows that before we can trust AI with these critical tasks, we must teach it how to rebuild its logical world when a single domino falls.