AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI Agents Struggle to Autonomously Implement Research Extensions

New benchmark reveals current AI agents fall short of independent scientific exploration.

Researchers have developed a new benchmark, REXBENCH, to test the ability of AI agents, powered by large language models (LLMs), to autonomously implement extensions to existing AI research. The findings are sobering: even the most advanced agents struggle significantly with these tasks, achieving success rates below 40% even with human guidance.

REXBENCH consists of 12 realistic tasks designed to evaluate an agent’s capability to extend prior AI research. Each task provides an AI agent with a research paper, its corresponding codebase, and instructions from domain experts detailing a new hypothesis to test. The goal is for the agent to modify the existing code to implement this new experiment and generate results. For instance, a task might ask an agent to modify an experiment to use a different language model, like changing from GPT-40 to Llama 3 70B, to see how it affects the outcomes.

The study evaluated nine different AI agents, combining three popular agent frameworks (aider, Claude Code, and OpenHands) with various LLM backbones. The results indicate a significant gap between the promise of AI in scientific research and the current reality. Across the board, agents performed poorly, with the best-performing combinations achieving only a 25% success rate on average. This means that for three out of every four tasks, the agents failed to correctly implement the research extension.

Even when researchers provided additional hints to guide the agents, the success rates only improved modestly. With detailed, step-by-step instructions, the best agents managed to improve their performance to 39% success. However, this still highlights a substantial need for human intervention. The researchers noted that the agents often made “explicit errors” like failed code execution or “implicit errors” where the code ran but produced incorrect results. A common issue observed was “overthinking,” where agents generated an excessive amount of reasoning or runtime, ultimately failing to produce usable code.

The REXBENCH benchmark is designed to be robust against data contamination, a common problem in evaluating AI systems. The novel extension tasks and their solutions are kept private, ensuring that the agents are truly solving the problems rather than recalling pre-existing solutions from their training data. This rigorous approach to evaluation reveals the current limitations of AI in autonomously performing the nuanced and complex tasks required for scientific research.

While the current performance is low, the researchers are optimistic. They observed that even in failure, the agents often demonstrated good “file recall,” meaning they could identify the relevant parts of the codebase to modify. This suggests that with further development, AI agents could become valuable tools for scientific exploration, even if they are not yet ready to conduct research independently. REXBENCH aims to serve as a catalyst for developing more capable AI research assistants.