Who Wrote the Code? Why AI "Hands" Matter Just as Much as Their "Brains"
Large language models are rapidly evolving from simple chatbots into autonomous software engineers. But when an AI successfully fixes a complex bug on GitHub, what deserves the credit? Is it the intelligence of the underlying AI model, or the software “harness” that translates its thoughts into actual keyboard strokes and terminal commands?
Until now, popular benchmarks like SWE-bench conflated these two forces. A new paper by researchers from top institutions, including Tsinghua University and Peking University, introduces Claw-SWE-Bench—a benchmark designed to isolate, control, and evaluate the “harness” (or what they call the “claw”) as an independent variable.
To understand why this matters, imagine two world-class novelists. If you give one a modern laptop and the other a broken typewriter with missing keys, their output will look vastly different despite having the same creative brain. In the AI world, the “claw” is that interface. It manages how the AI reads files, runs tests, and edits code.
The researchers proved that the design of this claw is a first-order factor in success. When using the GLM 5.1 model, a basic “bare” adapter—which simply asks the AI to output a standard code patch (a “diff” file) in text—achieved a dismal 19.1% success rate. The problem wasn’t the AI’s coding logic, but the sheer fragility of text formatting. A single misplaced newline or incorrect line number in a raw patch file causes the system to reject the code entirely.
By contrast, when the researchers used a full, sophisticated adapter, the success rate skyrocketed to 73.4% on the same model. This full adapter acts like an invisible assistant: instead of forcing the AI to write raw formatting code, it lets the AI edit the project files naturally, and then uses Git behind the scenes to neatly package and export the final patch.
Furthermore, Claw-SWE-Bench exposes a hidden dimension of AI development: financial cost. Some AI setups are “token-gluttons,” repeatedly querying expensive models and driving up massive bills. For example, running the flagship GPT-5.5 model across the benchmark’s 350 real-world GitHub issues resolved 78% of tasks but cost a whopping $1,399. Meanwhile, DeepSeek-V4 Flash solved a respectable 70.3% of the tasks for just $8.20—representing a cost difference of several orders of magnitude for a minor dip in performance.
To make these evaluations accessible to smaller teams, the researchers also released Claw-SWE-Bench Lite, an 80-instance subset. Rather than picking easy tasks, they used a rank-aware selection algorithm to ensure this cheaper subset accurately mirrors the difficulty, language diversity, and cost structures of the larger benchmark at a fraction of the price.
As AI agents become a staple of modern software development, Claw-SWE-Bench delivers a vital reality check. Assessing an AI’s coding capability is no longer just about choosing the smartest model—it is about designing the most efficient, cost-effective hands to let that brain work.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.