Beyond the One-Shot Fix: New Benchmark Tests if AI Can Actually Maintain Software
In the world of software development, writing code is the easy part. The real challenge—and where 80% of a project’s lifetime cost goes—is maintenance. Yet, for years, we have evaluated Artificial Intelligence using the equivalent of a 100-meter sprint: “Here is a single bug; fix it in one go.”
A new research paper from Sun Yat-sen University and Alibaba Group argues that these “snapshot” benchmarks are failing to capture the most critical skill a developer needs: the ability to sustain a codebase over time. To bridge this gap, the researchers have introduced SWE-CI, a benchmark that forces AI agents to evolve software through a long-term Continuous Integration (CI) loop.
The Problem with “Brittle” Fixes
To understand the need for SWE-CI, imagine an AI tasked with fixing a bug in an e-commerce platform’s tax calculator. A “one-shot” AI might simply hard-code a specific tax rate for a single zip code to pass the current failing test. In a traditional benchmark, that AI gets a perfect score.
However, three months later, when the business expands to a new state, that hard-coded “brittle” fix becomes a nightmare, causing new bugs and requiring a total rewrite. “An agent that hard-codes a brittle fix and one that writes clean, extensible code may both pass the same test suite,” the authors write. “Their difference in maintainability is simply invisible [until] the codebase must evolve.”
How SWE-CI Works
SWE-CI moves away from these static snapshots. Instead, it utilizes real-world data from 100 tasks across 68 GitHub repositories. On average, each task covers 233 days of a project’s history and 71 consecutive commits.
The benchmark evaluates AI using an Architect-Programmer protocol that mimics a real engineering team:
- The Architect Agent: Analyzes failing tests, identifies the root cause in the source code, and writes a high-level requirement document (the “what”).
- The Programmer Agent: Takes those requirements and implements the actual code changes (the “how”).
This process iterates through dozens of rounds. If the AI’s early decisions are messy or short-sighted, they will “accumulate technical debt,” making it progressively harder to pass tests in later stages of the evolution.
Measuring the “EvoScore”
To capture this, the researchers introduced EvoScore. Unlike standard metrics that give equal weight to every fix, EvoScore uses a “future-weighted” mean. Success in the later stages of a project’s evolution is worth more than success at the beginning. This rewards agents that prioritize long-term stability over “quick and dirty” wins.
The study also tracked the Zero-Regression Rate—the ability of an AI to fix new problems without breaking things that were already working.
The Results: A Long Way to Go
The researchers tested 18 different Large Language Models (LLMs), including industry leaders like GPT-4 and Claude 3. While models are getting better at coding at an “accelerating pace,” the results show a significant weakness in maintenance.
Even the highest performer, the Claude Opus series, struggled with regressions. Most models achieved a zero-regression rate below 0.25, meaning they frequently broke existing features while trying to add new ones.
The takeaway for the industry is clear: while AI is becoming a brilliant “one-shot” coder, we are still in the early days of creating AI “maintainers” capable of managing the long-term health of the software that runs our world.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.