AI’s Creative Wall: New "CutVerse" Benchmark Reveals Why Virtual Assistants Fail at Video Editing

🔊

💬 Ask

If you have ever watched an AI-generated video clip, you know the technology is impressive but incomplete. Currently, creating a polished, coherent video requires a human to take various short clips and manually piece them together using professional editing software. Why can’t an AI agent do this “last mile” of the work?

A new study by researchers from the Communication University of China, the National University of Singapore, and USEIT AI introduces CutVerse, the first benchmark designed to test autonomous AI agents in professional media post-production environments. Their findings reveal a stark reality: while AI assistants excel at browsing the web or filing spreadsheets, they hit a hard wall when handed the keys to professional creative suites like Adobe Premiere Pro, After Effects, and DaVinci Resolve.

To illustrate the complexity, the researchers propose a workflow they call “Vibe Cutting.” Imagine telling an AI, “Help me create a three-minute video about ‘Dog vs. Godzilla’.” To accomplish this, the agent must first use a generative model like Keling to create the raw video clips. Then, it has to open DaVinci Resolve to color-grade the footage, import those clips into Premiere Pro, align them on a multi-track timeline, precisely sync an audio track to the video, and apply transition effects.

While humans navigate these steps fluidly, AI agents fail spectacularly. The CutVerse benchmark—which features 186 complex, real-world tasks across seven professional applications—revealed that even the most advanced vision-language models, including Anthropic’s Claude and Google’s Gemini, achieve a dismal 36% average success rate on core editing tasks.

The researchers identified several critical bottlenecks preventing AI from taking over the editing bay:

Dense Layouts and “Blind Spots”: Unlike simple web interfaces with large, labeled buttons, creative software is incredibly compact. In After Effects, for instance, an agent tasked with using the “RotoBrush” tool to isolate a cartoon character might fail simply because the icon is a tiny, unlabeled brush nestled among dozens of other identical-looking tools.
The Need for “Compositional” Action: Many editing tasks require physical coordination. For example, to trim a clip precisely on a timeline, a human editor might hold the Shift key while dragging the mouse. Current AI agents struggle to execute these synchronized, multi-key actions, which cannot be broken down into simple, independent clicks.
The Milestone Gap and Repetitive Loops: An agent might successfully perform nine out of ten steps in a video-trimming pipeline, but a single pixel-level error early on can cause the entire project to cascade into failure. Furthermore, if an action produces no obvious visual change on screen, the agent often gets trapped in infinite loops, repeatedly clicking the same coordinate.

By releasing CutVerse as an open-source virtual testing ground, the researchers hope to shift the focus of AI development from static content generation to active, dynamic tool manipulation—bringing us one step closer to a future where AI can truly edit our stories.

AI Papers Reader

Personalized digests of latest AI research

AI’s Creative Wall: New "CutVerse" Benchmark Reveals Why Virtual Assistants Fail at Video Editing

Chat about this paper