New Benchmark Reveals Large Language Models Are Failing at Real-World Code Optimization

🔊

💬 Ask

Leading AI Agents Achieve Less Than 15% of Expert Performance in Crucial Software Engineering Tasks

A new benchmark, dubbed SWE-FFICIENCY, has exposed significant limitations in the ability of state-of-the-art AI software agents to perform practical performance optimization on major open-source repositories. The research, conducted by engineers from Harvard, Google, and Princeton, found that leading large language models (LLMs) achieved, on average, less than 0.15 times the speedup generated by human experts.

Unlike previous benchmarks that focused on fixing functional bugs, SWE-FFICIENCY directly tests an agent’s capacity for investigative performance engineering. The benchmark consists of 498 real optimization tasks scraped from pull requests (PRs) across nine widely used data science, machine learning, and high-performance computing (HPC) libraries, including numpy, pandas, and scipy.

The task requires an agent to take a full codebase and a slow, real-world performance workload (a reproducible script showing the slow behavior), identify the exact bottleneck, produce a code patch that improves runtime, and critically, ensure the patch passes all existing unit tests—a challenging requirement known as “pass-to-pass optimization.” Performance is measured using the Speedup Ratio (SR), normalized to the expert’s patch, where 1.0x represents human parity.

The results show a stark capability gap. Models that perform well on traditional bug-fixing tasks, like GPT-5 and Claude 4.1, struggle to translate those skills into efficiency improvements. Furthermore, agents frequently introduce correctness bugs, with many proposed edits causing existing unit tests to fail (ranging from 15% to over 45% failure rates depending on the model).

Qualitative analysis revealed that LLMs primarily fail due to two issues: mislocalization and favoring superficial fixes.

First, agents struggle to pinpoint the true performance bottleneck. The researchers found that in over 68% of cases, the expert speedup originated in functions the LM agents failed to touch altogether.

Second, when LMs do find an opportunity, they exhibit a “satisficing bias,” preferring quick, localized shortcuts over deep, systemic algorithmic improvements.

For instance, in a task optimizing a pandas operation, the human expert achieved a 20.5x speedup by restructuring the code to avoid a slow, unnecessary conversion to a generic Python object data type—a fundamental algorithmic change. In contrast, the leading LLM agent attempted an optimization by adding a small caching mechanism for null element checks, achieving only a 2.3x speedup. The LLM found an “easy win” but missed the critical, high-impact algorithmic rewrite.

This shortcut tendency, combined with a preference for invasive, difficult-to-maintain edits (such as global monkey-patching), underscores that current LLM agents lack the long-horizon reasoning and deep systems understanding required for autonomous performance engineering. The SWE-FFICIENCY benchmark and its accompanying data pipeline are released publicly to spur research toward closing this critical gap.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark Reveals Large Language Models Are Failing at Real-World Code Optimization

Chat about this paper