AI Coding Agents Face a New Reality Check: The Vision2Web Benchmark
The dream of autonomous “AI software engineers” has moved a step closer to reality, but a new research paper reveals that even the most advanced models still stumble when faced with the messy, multi-layered reality of building a modern website.
Researchers from Tsinghua University and Zhipu AI have introduced Vision2Web, a rigorous new benchmark designed to move beyond simple code snippets and test AI agents on their ability to build entire web systems from visual designs. Their findings, recently published, suggest that while AI is becoming a master of “look and feel,” it remains a novice at the “plumbing” and long-term planning required for full-stack development.
The Three Rungs of Web Mastery
Existing tests for AI coding agents often focus on fixing small bugs or generating single blocks of code. Vision2Web changes the game by using a “hierarchical” approach, breaking the development process into three increasingly difficult levels:
- Static Webpage Generation: Can the AI look at a screenshot of a desktop, tablet, and smartphone and write the code to reproduce that layout perfectly?
- Intuition: Imagine showing an AI a picture of a sleek Nike landing page and asking it to write the HTML/CSS so it looks identical across all devices.
- Interactive Frontend Development: Can the AI handle multiple pages and the links between them?
- Intuition: If you show the AI a “Services” page and a “Contact Us” page, it must write the logic so that clicking a button on one actually takes you to the other while keeping the design consistent.
- Full-Stack Website Construction: This is the ultimate stress test. The AI is given a high-level “Product Requirement Document” (PRD) and must build the frontend, the backend database, and the user authentication.
- Intuition: The AI is asked to build a community forum similar to Airbnb’s. It must ensure users can log in, create posts, upload images, and see their own profiles—all while matching a specific visual style.
Grading the Robot’s Homework
Evaluating a full website is notoriously difficult. To solve this, the researchers created a “Workflow-Based Agent Verification” system. Instead of just looking at the code, they use two specialized AI “judges” to test the final product.
First, a GUI Agent Verifier acts like a human QA tester, clicking buttons and filling out forms to see if the site actually works. Second, a VLM-based Judge (a Vision-Language Model) compares the AI’s creation to the original design prototypes, scoring it on visual fidelity. If the AI’s “Submit” button is bright red when the prototype was navy blue, the judge docks points.
The “Complexity Gap”
The results were a wake-up call for the industry. While top-tier models like Claude-Opus-4.5 showed impressive performance on static layouts (Level 1), their success rates plummeted as they moved toward full-stack tasks (Level 3).
The study found that AI agents struggle with “long-horizon planning.” For example, when building a complex forum, an AI might successfully build the login screen but fail to connect it to the database, or it might “forget” the visual style when moving from the homepage to a sub-page.
“Agent performance degrades consistently as task complexity increases,” the researchers noted, highlighting that “system-level planning” remains a major bottleneck. For now, while AI can help sketch the front door of your digital house, it still needs human supervision to make sure the lights actually turn on.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.