Can AI Build the Next Great Video Game? New Benchmark Tests "Agentic" Limits
In the world of artificial intelligence, large language models (LLMs) have already proven they can write decent Python scripts and troubleshoot web code. But can they build a video game? According to a new research paper from Carnegie Mellon and Princeton Universities, the “final boss” of AI coding has arrived, and current models are still struggling to beat the first level.
The researchers have introduced GameDevBench, the first comprehensive benchmark designed to evaluate “agentic” capabilities through the lens of game development. Unlike previous benchmarks that focus on isolated snippets of text-based code, GameDevBench requires AI agents to navigate the messy, multimodal reality of a modern game engine—specifically, the open-source Godot engine.
The Complexity of the “Plumber Problem”
To understand why game development is so difficult for AI, consider the “Italian plumber” example cited in the paper. If a human developer wants to create a character for a platformer, they don’t just write a movement script. They must create animations for specific states (idling, jumping, running), set up “colliders” so the character doesn’t fall through the floor, and link sound effects to specific actions.
GameDevBench consists of 132 of these complex tasks, curated from real-world YouTube and web tutorials. These aren’t just coding puzzles; they are architectural challenges. On average, solving a task in this benchmark requires three times as many lines of code and file changes as SWE-Bench, the current industry standard for software engineering AI.
Visual Intuition vs. Logical Code
The study highlights a massive gap in AI’s “multimodal” understanding—the ability to process text and images simultaneously. For example, a task might ask an agent to “add a walking animation using the provided spritesheet.” A spritesheet is a single image containing dozens of tiny drawings of a character in various poses.
A human developer looks at the sheet and intuitively knows which frames are “walking” and which are “attacking.” AI agents, however, frequently stumble here. The researchers found that while agents were relatively successful at “Gameplay Logic” (46.9% success rate), they plummeted to just 31.6% success when tasked with 2D graphics and animation. They often picked the wrong sprites or placed game elements (nodes) in the wrong part of the project hierarchy.
Giving AI “Eyes”
The researchers tested a “who’s who” of frontier models, including the Claude 4.5 family, Gemini 3, and GPT variants. Even the best-performing agent—Gemini 3 Pro—only managed to solve 54.5% of the tasks.
To help the agents, the team introduced two simple feedback mechanisms: screenshots and video. By allowing the AI to take a screenshot of the Godot editor or watch a short video of the game running, performance saw a dramatic spike. Most notably, Claude Sonnet 4.5 saw its success rate jump from 33.3% to 47.7% simply by being allowed to “see” the results of its work.
Why This Matters
The release of GameDevBench marks a shift in how we measure AI. We are moving away from models that simply “know things” and toward agents that can “do things” in complex, multi-layered environments. Game development is an ideal testbed because it sits at the intersection of creative expression and rigid software logic.
While we aren’t at the point where you can ask an AI to “make the next Elden Ring” from scratch, GameDevBench provides the roadmap for building agents that can actually understand the visual and spatial world they are coding. For now, the “Game Over” screen is still appearing for AI more often than not, but with visual feedback, the high score is slowly rising.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.