AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Beyond Intelligibility: New Benchmark Decodes the "Foreign Accent" in AI-Generated Indic Speech

Artificial intelligence has become remarkably good at speaking. If you ask a modern text-to-speech (TTS) system to read a sentence in Hindi or Tamil, it will likely produce a clear, understandable voice with a word error rate near zero. Yet, to a native speaker, something often feels “off.” The machine might be intelligible, but it sounds like a foreigner who has mastered the grammar but failed to grasp the soul of the local accent.

A new paper by researcher Venkata Pushpak Teja Menta introduces PSP (Phoneme Substitution Profile), an open-source benchmark designed to move beyond simple clarity. Instead of just asking “Can we understand this AI?”, PSP asks “Does this AI sound native?”

The Anatomy of an Accent

The core insight of the PSP paper is that an accent is not a single “vibe,” but a measurable collection of specific linguistic habits. In Indic languages, these habits often involve subtle mouth movements that non-native speakers—and global AI models—frequently ignore.

One of the most concrete examples highlighted in the paper is “retroflex collapse.” In languages like Telugu and Hindi, there is a vital distinction between “dental” sounds (where the tongue touches the teeth) and “retroflex” sounds (where the tongue curls back to touch the roof of the mouth). To an AI trained primarily on Western data, a retroflex “ṭ” might sound close enough to a standard “t.” But to a native speaker, collapsing these sounds is a dead giveaway of a non-native accent.

The PSP benchmark measures this “collapse rate” alongside five other dimensions, including aspiration fidelity (the “breathiness” in sounds like kh), vowel length, and the unique Tamil-zha (the specific ‘l̲’ sound in Tamil).

The “Flatness” Problem

The researchers tested several heavyweights in the AI field, including ElevenLabs and Cartesia, as well as Indic-specific models like Sarvam’s Bulbul. The results revealed a fascinating “intelligibility-accent gap.”

For instance, a model might score perfectly on Word Error Rate (meaning it says all the right words) but fail miserably on Prosodic Signature Divergence (PSD). PSD measures the rhythm and “music” of speech—the pitch range and timing. The researchers found that some commercial systems produced speech with a pitch range 40% narrower than a native speaker’s. To a listener, this manifests as a “flat, non-expressive” delivery—the words are correct, but the cadence is robotic.

Hindi is “Mature,” Tamil is the Challenge

The benchmark reveals a clear hierarchy of difficulty. Hindi TTS has largely “solved” the problem of basic pronunciation; most systems tested showed near-native accuracy in basic sounds.

However, as the AI moves south, the challenge grows. In Telugu, retroflex collapse rates jumped to 40%, and in Tamil—described as the most “severe” target—those rates climbed as high as 70%. The paper notes that no single system is currently perfect across every dimension, suggesting that “native-ness” is the next great frontier for the “next billion users” entering the digital world.

By releasing PSP as an open-source tool, the researchers hope to give developers a “diagnostic loop.” Instead of just getting a single score, an engineer can now see exactly where their model is failing—whether it’s losing the rhythm of the sentence or literally failing to curl its digital tongue.