Beyond the Answer Key: ROSE Introduces a "Prover-Refuter" System to Fix How We Grade AI
For years, the gold standard for testing AI models that translate natural language into database code (NL2SQL) has been a simple pass-fail test called Execution Accuracy (EX). If the AI’s generated code produces the exact same result as the “official” answer key, it passes. If it doesn’t, it fails.
However, new research from the Hong Kong University of Science and Technology and the National University of Singapore suggests that this grading system is broken. In a recently published paper, researchers reveal that as AI models become more sophisticated, our methods for evaluating them are actually becoming less reliable. To solve this, they have introduced ROSE (ReasOning ScorE), an intent-centered metric that moves beyond rote matching to judge whether an AI actually understands what a user wants.
The Problem with the “Answer Key”
The researchers argue that the traditional EX metric suffers from three major flaws: stylistic variance, question ambiguity, and “dirty” data.
To build an intuition for this, imagine asking an AI, “Who are the top-selling employees this year?” One AI might return a list of names and total sales, while another might return names and the number of transactions. If the human who created the benchmark only included the “total sales” version in the answer key, the second AI gets a zero—even though its answer is perfectly logical.
Furthermore, many popular benchmarks are riddled with errors. The researchers found that in some major datasets, up to 25% of the “correct” answers were flagged as wrong by human experts. Under the old system, an AI that provides a truly correct answer is penalized for not mimicking the human’s mistake.
Enter the Prover-Refuter Cascade
ROSE abandons the idea of a single “correct” string of code. Instead, it employs a two-stage “adversarial” process powered by large language models (LLMs).
- The SQL Prover: This stage looks only at the user’s question and the generated code. It asks: “Based on the database structure, does this code logically satisfy the user’s intent?” It ignores the answer key entirely to avoid being biased by potential errors in the benchmark.
- The Adversarial Refuter: This stage then looks at the “official” answer key. If the AI’s code and the official code produce different results, the Refuter acts as a prosecutor. It challenges the Prover’s approval by pointing out the discrepancy. The system then decides if the AI’s version is a valid alternative or a genuine failure.
In tests on a new expert-verified dataset called ROSE-VEC, this new metric outperformed existing methods by nearly 24% in its ability to agree with human experts.
A Growing “Metric Crisis”
The paper’s most striking finding is what the authors call a “metric crisis.” As AI models like GPT-5 and DeepSeek-R1 become more powerful, the gap between their “true” performance and their traditional scores is widening.
Because advanced models are more “expressive,” they often find creative or highly specific ways to answer questions that don’t match a rigid, pre-written answer key. The researchers found that for the most advanced models, the traditional EX metric might undercount their true accuracy by as much as 20%.
By shifting the focus from “matching the key” to “fulfilling the intent,” ROSE provides a more honest look at the state of AI development. It suggests that the future of AI evaluation isn’t about having a better answer key, but about having a better judge.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.