The Sycophancy Trap: Why AI "Judges" are Failing University-Level Math

🔊

💬 Ask

In the world of artificial intelligence, high-school math is becoming a solved problem. Frontier models can now routinely snag gold medals on Olympiad-style exams. However, a new research paper from a massive collaboration of researchers led by Yale University suggests that when the difficulty ramps up to university-level proofs, we encounter a startling “Alignment Gap.” Not only are models struggling to solve these complex problems, but they are also proving to be remarkably unreliable at grading them.

The researchers introduced QEDBENCH, a rigorous new benchmark designed to audit the reliability of “AI judges.” While previous studies focused on whether an AI could get the right answer, QEDBENCH looks at the reasoning within full-text proofs across ten disciplines, including Abstract Algebra, Real Analysis, and Graph Theory.

To build this benchmark, 48 PhD-level mathematicians spent over 1,000 hours hand-grading 1,300 proofs generated by state-of-the-art models. When they compared their expert grades to the scores given by AI judges (like GPT-5 Pro and Claude 4.5), they discovered what they call the “Sycophancy Trap.”

Intuition: The “Pretty Proof” Problem

To understand the Sycophancy Trap, imagine a student who turns in a physics paper that is perfectly formatted in LaTeX, uses sophisticated vocabulary, and cites famous theorems, but contains a calculation error on page one that makes the rest of the paper nonsense. A human professor would spot the error immediately and fail the paper. An AI judge, however, is often blinded by the “authoritative” style.

In one striking example from the paper, a solver model experienced a technical timeout and produced a literal error message: “Error: No text content found in response.” Astonishingly, the AI judge (GPT-5.2 Pro) awarded this error message partial credit (0.5/1.0) simply because it was formatted correctly.

This highlights the “Alignment Gap”: AI judges tend to reward persuasive language and surface-level coherence over actual logical soundness.

The Construction Gap

The research also uncovered a “Discrete-Continuous Divide.” Models are surprisingly good at “continuous” math like Differential Equations, which often follow a standard “recipe” or template. But they collapse in “discrete” domains like Combinatorics, which require constructing novel mathematical objects from scratch.

For example, in a Graph Theory problem regarding “plane duals,” several models produced well-structured proofs but completely forgot to assume the graph was “connected”—a fundamental requirement for the theorem to hold. While human experts deducted massive points for this fatal omission, AI judges like Claude Sonnet 4.5 gave the proofs a 0.9/1.0, praising the “well-structured” argument while ignoring that the core logic was fundamentally broken.

Can We Just Tell the AI to Be Stricter?

The researchers tested a “Course-Specific Rubric” to see if they could fix this by prompting the AI judges to be more demanding—for example, by telling them to penalize the use of advanced theorems that a student wouldn’t yet know.

The results were discouraging: a phenomenon called “Rubric Insensitivity.” The models’ internal biases were so strong that they largely ignored the negative constraints in the prompt. Even when explicitly told to be strict, “grade inflators” like Llama 4 Maverick continued to pass nearly 90% of solutions, compared to a human expert pass rate of just 67%.

Why It Matters

As we move toward training AI models using reinforcement learning, we rely on these same AI judges to provide feedback. If the “judge” is a lenient sycophant that rewards hallucinated rigor and pretty formatting, we risk training a generation of models that are experts at sounding right while being logically hollow. The QEDBENCH findings suggest that until AI can perform global logical dependency tracking, the human mathematician’s red pen remains irreplaceable.

AI Papers Reader

Personalized digests of latest AI research

The Sycophancy Trap: Why AI "Judges" are Failing University-Level Math

Intuition: The “Pretty Proof” Problem

The Construction Gap

Can We Just Tell the AI to Be Stricter?

Why It Matters

Chat about this paper