New Benchmark InnoGym Challenges AI Agents to Be Truly Innovative, Not Just Correct

🔊

💬 Ask

In a significant move to push the boundaries of artificial intelligence, researchers have introduced InnoGym, a pioneering benchmark designed not just to check if AI agents find the right answer, but whether they arrive at that answer through genuine innovation and original methodology.

While Large Language Models (LLMs) and AI agents have demonstrated impressive capabilities in areas like code generation and scientific discovery, current benchmarks overwhelmingly reward mere correctness. The creators of InnoGym argue that this narrow focus misses a fundamental aspect of intelligence: the ability to generate novel, effective solutions.

“Intelligence and innovation lie not only in what is achieved, but in how,” the researchers write. Two agents might produce the same accurate result, but one may have relied on a standard textbook method, while the other developed a radically new approach.

Measuring Breakthroughs and Novelty

To quantify this, InnoGym introduces a comprehensive evaluation framework using two complementary metrics:

Performance Gain (G): This measures how much a new solution improves upon the current best-known baseline, or the state-of-the-art ($V_{known}$). A positive score signifies a super-human breakthrough.
Novelty (N): This quantifies the methodological dissimilarity between a new solution and all prior known approaches. Novelty is measured using an “Agent-as-judge” system, where a specialized LLM extracts the core strategy of a solution (e.g., data pipeline, model architecture, optimization techniques) and compares it against baseline strategies to assign a dissimilarity score.

This dual-metric approach means true innovation is only credited when both the performance gain is high and the methodology is distinct.

The benchmark component, iBench, comprises 18 standardized “Improvable Tasks” drawn from complex real-world challenges, such as the ROADEF optimization competition, 2D Bin Packing, and various machine learning engineering problems. Unlike fully “solved problems” or exploratory tasks lacking baselines, these tasks have existing human solutions but significant room for improvement, making them ideal targets for innovative AI.

The framework also includes iGym, a unified execution environment providing robust tool management and long-horizon evaluation capabilities necessary to test solutions in complex, multi-step engineering and scientific scenarios.

The Robustness Gap

Initial experiments using existing state-of-the-art agent frameworks (such as MLAB, CODEACT, and AIDE) revealed a crucial bottleneck: the “robustness gap.”

Researchers found that while agents could sometimes produce highly novel approaches, their lack of reliability and execution prowess prevented these new ideas from delivering meaningful performance gains. For instance, in tasks like the Circle Packing optimization challenge, agents that were explicitly prompted to prioritize novelty did achieve higher Novelty scores, but this exploration often came at the cost of correctness, resulting in lower Performance Gains.

In one analysis, frameworks achieving mid-to-high novelty on complex tasks still returned some of the lowest performance scores because their novel code or pipeline implementations failed due to errors and lack of robustness.

The results underscore a key takeaway for AI development: simply being creative is not enough. For AI agents to truly drive scientific and engineering progress, their inventive ideas must be correctly and reliably executed to successfully push the state of the art. InnoGym provides the rigorous, dual-axis platform necessary to measure this creative efficacy, setting a high bar for the next generation of AI agents.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark InnoGym Challenges AI Agents to Be Truly Innovative, Not Just Correct

Measuring Breakthroughs and Novelty

The Robustness Gap

Chat about this paper