Mar 31, 2026arXiv:2603.29217

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

Lukuang Dong, Ziwei Li, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou

AI Summary

This paper investigates the use of large language models (LLMs) for multilingual phoneme-to-grapheme (P2G) conversion in speech recognition, addressing the challenges of language-aware generation and cross-language data imbalance. They introduce a simplified version of Stochastic k-best marginalization (S-SKM) as a Monte Carlo approximation to handle speech-to-phoneme (S2P) uncertainty. Experiments on the CV-Lang10 benchmark show that robust training with S-SKM and low-resource oversampling reduces the average word error rate (WER) from 10.56% to 7.66%.

Key Contribution

LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.

Abstract

Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

Related Papers