Beyond Bug-Fixing: New Benchmark Reveals AI’s Struggle with Complex Software Engineering

🔊

💬 Ask

For the past year, artificial intelligence has been taking a victory lap in the world of software development. On popular benchmarks like SWE-bench, which tasks AI agents with fixing minor bugs, top-tier models have seen their success rates soar from less than 10% to over 70%. But a new study suggests these models may be more like gifted mechanics than architects.

Researchers from the Chinese Academy of Sciences and Huawei have introduced FeatureBench, a rigorous new evaluation framework designed to test AI agents not on their ability to fix broken code, but on their ability to build entirely new features from scratch. The results are a sobering reality check: the same models that master bug-fixing are hitting a “performance wall” when asked to handle the complexities of real-world feature development.

The Complexity Gap

To understand the difference, imagine the AI is a construction worker. Current benchmarks mostly test if the AI can replace a cracked window or tighten a leaky pipe—tasks that are localized and require understanding only a few lines of code.

FeatureBench, however, asks the AI to build an entire sunroom onto an existing house. This requires “feature-level” development: writing hundreds of lines of code across multiple files, ensuring the new room doesn’t collapse the roof, and making sure the electrical wiring integrates perfectly with the existing system.

In one example from the paper, an agent is tasked with implementing a GPT-2 model within a specific library. It isn’t just writing a snippet; it must provide a “directly callable” solution that follows a strict interface, interacts with external tools, and passes a gauntlet of unit tests. When the researchers tested Claude 4.5 Opus—one of the world’s most advanced coding models—on these tasks, its success rate plummeted from a 74.4% on bug-fixing to a mere 11% on FeatureBench.

How FeatureBench Works

The researchers built an automated “test-driven” toolkit that scours 24 major open-source Python repositories, including heavyweights like pandas, transformers, and pytorch-lightning.

The system uses a “dependency graph” to identify specific features. It then essentially “hollows out” that feature from the codebase and tasks the AI with rebuilding it. Because the environment is packaged in Docker, the evaluation is entirely execution-based. If the AI’s code doesn’t actually run or fails even a single test, it fails the task.

Why AI Fails: “Idle Habits” and Context Loss

The study highlights two primary reasons for the AI’s failure. The first is contextual blindness. When a feature spans many files, models often lose track of how different parts of the code talk to each other. This leads to frequent “NameErrors,” where the AI tries to use a function or variable it hasn’t properly imported or defined.

The second reason is what the researchers call the “Idle Habits” of LLMs. The models exhibited a form of digital “laziness,” often hallucinating or guessing how an existing part of the codebase worked instead of actually “reading” the files to check. This led to “TypeError” and “AttributeError” occurrences where the AI assumed a tool worked one way, but the reality was different.

A New North Star

FeatureBench provides 200 high-quality tasks and over 3,800 executable environments. Because the collection process is automated, it can be constantly updated to prevent “data leakage”—the phenomenon where models perform well simply because they were trained on the answers.

For the AI industry, the message is clear: the “easy” era of bug-fixing is over. If AI agents are to become true autonomous collaborators, they must move beyond patching leaks and learn how to build the house.

AI Papers Reader

Personalized digests of latest AI research

Beyond Bug-Fixing: New Benchmark Reveals AI’s Struggle with Complex Software Engineering

The Complexity Gap

How FeatureBench Works

Why AI Fails: “Idle Habits” and Context Loss

A New North Star

Chat about this paper