AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Can You Hear Me Now? New Dataset Reveals the Hidden Struggles of Voice AI

When we interact with the latest “Speech Large Language Models” (SLLMs), we expect them to function like the versatile assistants seen in science fiction—systems that can listen to a recording and instantly summarize it, or hear a sentence and translate it into another language. However, a new study suggests that our current methods for testing these AIs are fundamentally flawed because we are grading “voice” models on their ability to read “text.”

Researchers from the Karlsruhe Institute of Technology and several other European institutions have released a new dataset called DoWhatISay (DOWIS). Their findings, recently published in a paper titled “Do What I Say: A Spoken Prompt Dataset for Instruction-Following,” reveal a stark reality: even the most advanced AI models struggle significantly when they are given spoken instructions rather than written ones.

The “Text-to-Voice” Gap

Most AI developers evaluate their models using text-based prompts. If you want a model to transcribe a meeting, the developer tests it by typing: “Please transcribe the following audio.” But in the real world, a user is more likely to tap a button and simply say, “Hey, can you write out what’s being said in this audio?”

The DOWIS researchers argue that evaluating a voice AI using only text prompts is like testing a chef’s ability to cook by asking them to read a recipe. It doesn’t capture the messiness of real-world interaction. To solve this, they created a massive library of human-recorded prompts across 11 languages—including German, Italian, Russian, and Czech—covering nine different tasks like speech translation and audio chaptering.

Testing the Limits of “Style”

The DOWIS dataset is unique because it provides 10 different ways to ask for the same task, categorized into five styles: basic, detailed, short, formal, and informal.

To build an intuition for why this matters, imagine you want an AI to summarize a recording. Under the DOWIS framework, the model is tested on its response to:

  • Formal Spoken Prompt: “I would appreciate it if you could provide a summary of this audio.”
  • Informal Spoken Prompt: “Hey, give me the gist of what they’re saying here.”

The study tested two state-of-the-art models: Microsoft’s Phi-4 Multimodal and Alibaba’s Qwen2.5-Omni. The results were telling. For tasks that result in text output (like transcription), the models performed significantly worse when the instructions were spoken. In fact, for certain languages, the models’ performance dropped so sharply under spoken instructions that they became nearly unusable, even though they excelled when the exact same instructions were presented as text.

An “Overly Optimistic” Picture

The researchers found that informal and short prompts were the hardest for AIs to follow. Models generally prefer the “structured” nature of formal language, but humans rarely speak that way in casual settings.

Interestingly, there was one exception: tasks that require the AI to speak back (like text-to-speech synthesis). In these cases, the models handled spoken instructions much better, sometimes even outperforming text prompts.

The takeaway for the tech industry is clear: our current benchmarks provide an “overly optimistic” picture of AI capability. By relying on text-based evaluations, we are building systems that look great on paper but may stumble the moment a user opens their mouth to speak. The DOWIS dataset provides a new, more rigorous yardstick to ensure that the “voice” of the future actually understands the voice of its users.