CodeClash: New Benchmark Reveals AI Programmers Are Creative, but Still Lack Strategy and Cleanliness

🔊

💬 Ask

A new competitive benchmark called CodeClash is challenging Large Language Models (LLMs) to evolve code over multiple rounds to achieve high-level goals, mimicking the long-term, adversarial nature of real-world software engineering.

While previous benchmarks tested LLMs on discrete tasks—like fixing a single bug or implementing an isolated function—CodeClash forces models to pursue complex, open-ended objectives, such as maximizing profit or ensuring survival against evolving opponents.

The results, drawn from 1,680 tournaments involving 8 frontier LLMs across 6 distinct competitive “code arenas,” show that while modern models are creative developers, they still suffer from fundamental limitations in strategic reasoning and long-term codebase maintenance.

The Adversarial Gauntlet

CodeClash operates as a multi-round competition. In each round, LLMs, acting as autonomous software agents, enter an Edit Phase where they modify their codebase using command-line instructions. They might write new analysis scripts, debug errors, or develop novel algorithms. This is followed by a Competition Phase, where the modified codebases compete head-to-head.

For example, in the Battlesnake arena, LMs must program a snake to survive on a grid, avoid collisions, and acquire resources, aiming for the last snake standing. In the Poker arena, the goal is maximizing chips in a game of No-Limit Texas Hold’em. The output logs from the competition phase are fed back into the agent’s codebase as the sole source of feedback for the next round’s strategic planning.

The benchmark’s Elo ratings crowned Claude Sonnet 4.5 (1389) as the overall top performer, followed closely by GPT-5 (1360) and o3 (1343). However, no single model dominated all six distinct arenas, highlighting the specialized nature of the challenge.

The Messy Codebase Problem

The multi-round structure exposed a critical weakness: LLMs are terrible at code hygiene and long-term maintenance. Researchers found that repositories managed by the models became progressively messy and redundant over time.

Instead of refining and reusing core scripts, models tended to generate new, single-use analysis files or temporary testing tools for each round. Claude Sonnet 4.5, for instance, created an average of 18 “throwaway files” per tournament, leading to a sprawling, disorganized codebase. The total number of files created grew almost linearly across the 15 rounds, a trend contrasting sharply with how human engineers typically maintain projects.

Strategic and Validation Failures

Beyond cleanliness, LMs demonstrated poor strategic thinking. The study revealed that even top models struggled to correctly interpret detailed competition logs to diagnose failures and adapt.

Often, models were observed to hallucinate causal explanations for a loss—inferring why a game was lost by merely reading the opening lines of a log file, even if those lines contained no relevant information. Moreover, most models deployed untested code. Only Claude Sonnet 4.5 and GPT-5 validated their changes in a majority of rounds, either through arena simulations or unit tests. The others routinely pushed changes without confirming they improved performance or even worked as intended.

The gap between LLM and human performance remains stark. When Claude Sonnet 4.5—the highest-ranked model—was pitted against an expert human-written bot named “gigachad” in the RobotRumble arena, the model was dominated, failing to win a single round out of 150 simulations.

CodeClash’s open-source release aims to provide a robust, competitive training environment to advance the study of autonomous, goal-oriented software development, pushing future LLM agents beyond basic bug fixes toward true strategic engineering.

AI Papers Reader

Personalized digests of latest AI research

CodeClash: New Benchmark Reveals AI Programmers Are Creative, but Still Lack Strategy and Cleanliness

The Adversarial Gauntlet

The Messy Codebase Problem

Strategic and Validation Failures

Chat about this paper