AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Large Language Models Bring New Hope to Grapheme-to-Phoneme Conversion, Especially for Underrepresented Languages

Speech recognition and synthesis have benefited greatly from advancements in Grapheme-to-Phoneme (G2P) conversion, the process of translating written text to its corresponding pronunciation. However, building accurate G2P systems is especially challenging for languages with polyphone words (words with multiple pronunciations) and context-sensitive phonemes. This is because the pronunciation of a word can change based on its surrounding context, making it difficult for traditional G2P systems to accurately capture the nuances of speech.

Consider the Persian word for “flower,” which is spelled the same as the word for “mud” - both are written as “گل.” However, “flower” is pronounced as /gol/ while “mud” is pronounced as /gel/. Determining the correct pronunciation relies on understanding the surrounding context, like the sentence structure, to discern whether the word refers to “flower” or “mud”. This challenge is further complicated by the presence or absence of the Ezafe, a short vowel sound /e/ used to connect related words in a Persian phrase. Whether or not the Ezafe is present can change the meaning of an entire sentence, making it a critical but subtle factor in accurately determining pronunciation.

This is where large language models (LLMs) come in. These advanced AI systems have demonstrated remarkable abilities in understanding and processing language, including phonetics. A recent paper by researchers at Sharif University of Technology in Iran, titled “LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study,” explores the potential of LLMs for G2P conversion, particularly in the context of the Persian language.

The researchers acknowledge that while LLMs show promise in G2P, their performance can be hindered by errors in their outputs and their limitations in underrepresented languages. To overcome these challenges, the researchers proposed a combination of innovative prompting techniques and post-processing methods designed to improve LLM outputs without requiring additional training or labeled data.

They introduced a novel benchmarking dataset called “Sentence-Bench” specifically designed for evaluating G2P models on sentence-level challenges in Persian. This dataset includes sentences annotated with their phonetic sequences, including examples of polyphone words and instances of the Ezafe phoneme.

The researchers found that their combined approach, utilizing a specific prompting strategy and a dictionary-based correction method, yielded impressive results. They achieved significant improvements in phoneme error rate (PER), the standard metric for evaluating G2P models.

They also achieved a significant improvement in accurately predicting the pronunciation of polyphone words and successfully identified the presence or absence of the Ezafe. Their findings suggest that LLMs can outperform traditional G2P systems, especially in underrepresented languages like Persian, where resources for training are limited.

The study highlighted the potential of LLMs to revolutionize G2P conversion. They showcased that LLMs can be used to create high-quality G2P tools without relying on extensive labeled data. This opens up new possibilities for developing accurate and efficient G2P systems for underrepresented languages, ultimately making speech recognition and synthesis more accessible to a wider range of users.