Beyond Pixels: New Benchmark Reveals Why AI-Generated Web Pages Look Perfect But Still Fail

🔊

💬 Ask

AI models have become remarkably good at coding beautiful web pages from simple sketches or text prompts. However, a stunning layout often hides a dysfunctional user experience. A new paper by researchers from Tsinghua University and Huawei Noah’s Ark Lab introduces WebRISE, a benchmark designed to test if AI-generated websites actually work when users start clicking around.

To understand the problem, imagine ordering a shirt online. You decide to remove it from your digital shopping cart. On a poorly coded AI-generated page, clicking “remove” might uncheck the box, but the total price at the bottom remains unchanged, and the checkout button stays active. Visually, the page looks fine, but functionally, the underlying “state” of the application is broken.

Traditional evaluation tools typically look at static screenshots or simple, single-action tests. WebRISE changes this by using Interaction Contract Graphs (ICGs). Instead of just checking if a button exists, WebRISE simulates real user pathways—like adding items, applying filters, and navigating pages—to verify that the website’s underlying logic adapts correctly at every step.

The researchers tested 14 prominent AI models, including OpenAI’s GPT-5.5 and Google’s Gemini systems, across more than 400 web development tasks. The models were given instructions in various formats, including text, sketches, and short interaction videos.

The results reveal that interactive web generation is still far from solved. Even the strongest performing model, GPT-5.5, successfully navigated only about 65.6% of the required interactive transitions. Crucially, the study confirmed that a page’s visual appeal is a poor indicator of its functionality. One tested model, Qwen3.6-35B-A3B, scored an impressive 80.8 out of 100 on visual quality when generating pages from Markdown, yet managed a meager 15.5 on actual interaction correctness.

Interestingly, the study found that showing the AI a video demonstration of how the page should behave yielded the best results, improving functional accuracy by more than 10 percentage points compared to text-only instructions. Still, “implicit” requirements—like displaying a loading spinner during a slow search or keeping a draft saved when a user accidentally refreshes—remain a massive hurdle for even the best models.

By shifting the focus from how a webpage looks to how it behaves under pressure, WebRISE provides a much-needed reality check for the automated software development industry.

AI Papers Reader

Personalized digests of latest AI research

Beyond Pixels: New Benchmark Reveals Why AI-Generated Web Pages Look Perfect But Still Fail

Chat about this paper