AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Cleaning Up Humanity’s Last Exam: Why the Toughest AI Test Needed a Course Correction

In the race to build “superintelligent” AI, the yardstick used to measure progress is everything. Recently, a benchmark called Humanity’s Last Exam (HLE) emerged as the gold standard—a collection of 2,500 expert-level questions designed to be so difficult that they would be the final hurdle for large language models (LLMs). But a new study from researchers at Alibaba Group and the Qwen Team reveals a startling reality: even the “final exam” can be riddled with errors.

The paper, titled HLE-Verified, argues that as AI tests become more complex, the integrity of the questions themselves often degrades. When a question is ambiguous or the answer key is flat-out wrong, it doesn’t just lower a model’s score; it distorts our entire understanding of what these models can actually do.

The Problem with “Noisy” Data

To fix this, the researchers performed a forensic audit of the original HLE. They found a “non-trivial” number of noisy items, ranging from contradictory statements to incorrect rationales. This noise acts like a fog, making it impossible to tell if an AI failed because it wasn’t “smart” enough or because the question was a riddle with no correct answer.

The team introduced a two-stage verification process. In Stage I, they identified 641 “Gold” items that were correct as originally written. In Stage II, they spent thousands of hours meticulously revising 1,170 flawed but fixable questions. The remaining 689 items were relegated to an “uncertain” set—questions so murky they require further specialist input.

Concrete Examples of Benchmark Blunders

The paper provides a few striking examples of how these errors manifest in high-level science and technology:

  • Computer Science: One question about “speculative decoding” (a technique to speed up AI) originally claimed an algorithm’s acceptance rate should be less than 1.0 due to hardware-specific differences in GPU kernels. The researchers realized this conflated “hardware noise” with the actual mathematical theory. They revised the answer to exactly 1.0 to reflect the theoretical property of the algorithm.
  • Biology/Medicine: A complex case study regarding “oculomotor palsy” (a type of eye paralysis) had incorrectly localized the issue to the reticular formation of the brain. Experts corrected this to the midbrain, which contains the actual nerve complex involved. Without this fix, a “perfect” medical AI would have been penalized for giving the anatomically correct answer.
  • Chemistry: Several questions featured molecular mass constants that were mathematically inconsistent, making the chemical systems described “unsatisfiable.”

A 40-Point Swing

The impact of these corrections was massive. When the researchers tested seven state-of-the-art models (including versions of GPT, Claude, and Gemini) on the revised benchmark, the results shifted dramatically.

On average, model accuracy jumped by 7 to 10 percentage points across the entire set. More tellingly, on the specific subset of questions that had been corrected, model performance soared by a staggering 30 to 40 percentage points. Essentially, the models were much smarter than the original test gave them credit for—they were simply being “gaslighted” by a broken answer key.

The researchers also found that when a question was flawed, the models tended to report lower confidence in their answers. This suggests that model confidence might actually be a useful “smoke detector” for identifying bad questions in future datasets.

The Bottom Line

HLE-Verified serves as a cautionary tale for the AI industry. As we move toward testing AI on the frontiers of human knowledge, the bottleneck may no longer be the intelligence of the machine, but the accuracy of the human-provided “truth” used to judge it. By releasing this verified dataset, the researchers have provided a more transparent, reliable map for the road to AGI.