The "Expert Nitpicker" Test: Why Top AI Coding Agents Falter When Humans Enter the Loop
For all their raw programming muscle, today’s leading artificial intelligence models still struggle with a fundamental human reality: dealing with a boss who changes their mind.
In recent years, AI coding assistants have aced standard industry benchmarks by working in a vacuum. These tests typically hand an AI agent a massive, fully specified instruction manual and let it write code autonomously for hours. But in the real world, software engineering is a messy, conversational sport.
To bridge this gap, researchers at Scale AI have introduced SWE-INTERACT, a new benchmark designed to test how well AI coding agents collaborate with humans over long, unpredictable coding sessions. The results suggest that today’s best models are far from ready to operate as true collaborative teammates.
The Real-World Friction
To understand the problem SWE-INTERACT solves, imagine asking an AI assistant to help you refactor a database. In a traditional benchmark, the AI is given a detailed prompt outlining ten precise technical requirements upfront.
In real life, however, you probably start with a vague Slack message: “Let’s clean up our database schema to match our new files.”
Only after the AI shows you its initial draft do you start nitpicking:
- “The constructor needs to return an error, not just the session object.”
- “Actually, don’t touch those test files, I’ve already handled those.”
- “One last thing: make sure the ID column auto-increments.”
This iterative back-and-forth—which programmers call “vibecoding”—is exactly what SWE-INTERACT simulates. By analyzing thousands of real developer chats, the researchers built a simulated user modeled after a common tech archetype: the Expert Nitpicker. This simulated boss is terse, demanding, and deliberately drip-feeds requirements only after inspecting the agent’s workspace and reviewing its code.
A 50% Drop in Performance
When the researchers evaluated top-tier models on SWE-INTERACT, the drop-off in performance was stark.
Frontier models like Claude Opus 4.8 and GPT 5.5, which successfully resolved roughly 50% of coding tasks when given all instructions upfront, saw their success rates plunge to just 25% to 27% in the interactive, multi-turn environment. Additionally, the interactive sessions required three to four times more computational steps and cost significantly more.
Model Success Rates: Single-Turn vs. Interactive
┌──────────────┬──────────────────────┬─────────────────────────┐
│ Model │ Single-Turn Baseline │ Interactive (Ours) │
├──────────────┼──────────────────────┼─────────────────────────┤
│ Opus 4.8 │ 50.7% │ 26.7% │
│ GPT 5.5 │ 48.0% │ 24.7% │
└──────────────┴──────────────────────┴─────────────────────────┘
The study identified “forgotten requirements” as one of the most common failure modes. Because the coding session happens over multiple turns, AI agents suffer from a form of short-term amnesia. For example, an agent might successfully implement a crucial API exception handler in turn two. But by turn five, while rewriting a different part of the code to satisfy a new request, the agent will completely overwrite or delete its own earlier work.
Collaboration is a Distinct Skill
The findings from SWE-INTERACT suggest that building a great AI developer isn’t just about adding more raw coding horsepower. Software engineering is as much about communication, clarifying ambiguity, and iteratively building on feedback as it is about syntax.
By reframing benchmark difficulty around human interaction rather than just task complexity, SWE-INTERACT highlights a critical, under-measured capability axis. If AI agents are to become true co-pilots, they must learn to survive the gauntlet of the human feedback loop.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.