The Conversational AI Illusion: Why Real-Time Assistants Still Fail the "Flow" Test

🔊

💬 Ask

Imagine cooking a breakfast burrito while chatting with a smart assistant. You ask it to guide you step-by-step. As you crack an egg, a piece of eggshell slips into the bowl. A human companion would instantly interject: “Wait, stop! You dropped a shell.” But today’s state-of-the-art artificial intelligence models would likely either list the entire recipe at once, sit in silence, or inexplicably suggest a guide for a dishwasher instead.

This conversational disconnect is the focus of a new research paper introducing OmniInteract, a pioneering benchmark designed to test “omnimodal” AI assistants in true, real-time scenarios. Developed by an international research team, including scientists from CUHK MMLab, SJTU, and NTU, OmniInteract exposes a stark reality: while today’s AI models excel at analyzing videos and text in hindsight, they are remarkably clumsy when forced to interact in a live, unpredictable stream.

Historically, AI assistants have been evaluated using “offline” methods. Models are given a completed video or a text transcript and asked questions about what already happened. OmniInteract changes the rules. It feeds AI models continuous, live audio-visual streams. The models must listen to spoken queries embedded in background noise, watch visual events unfold, and decide for themselves if and when they should speak.

To build an intuition for how difficult this is, consider a “nested” interaction scenario. A user tells the AI, “Let me know when the kettle on the stove starts boiling.” While waiting, the user holds up a book and asks, “What’s the title of this book?” A competent assistant must pause its kettle-monitoring task, answer the book question immediately (“It’s The Stranger”), and then seamlessly resume watching the kettle. In evaluations on OmniInteract, top models like Google’s Gemini 2.5 Flash Live and Qwen3.5-Omni often answered the book query but completely forgot to go back to watching the kettle.

The benchmark also tests continuous task monitoring (called “1QnA”). In the breakfast burrito cooking scenario, the AI must deliver small, timely pieces of advice only when the cook reaches specific stages of preparation. The results were humbling: the best-performing AI achieved an interaction score of just 0.052 out of 1.0 on these continuous tasks, showing that long-term tracking of human activities remains an open challenge.

Furthermore, the researchers discovered a “reasoning tax” associated with real-time processing. When the open-source model MiniCPM-o 4.5 was given a math problem to solve offline, it scored 0.68. But when it had to solve the same problem while actively listening to a live audio feed and managing the conversation, its score plummeted by 33% to just 0.34.

Ultimately, OmniInteract reveals that building a truly helpful digital companion—one that knows when to speak, when to listen, and how to handle an interruption—is about more than just raw intelligence. It requires mastering the delicate, real-time dance of human flow.

AI Papers Reader

Personalized digests of latest AI research

The Conversational AI Illusion: Why Real-Time Assistants Still Fail the "Flow" Test

Chat about this paper