Search papers, labs, and topics across Lattice.
This paper introduces a few-shot accent synthesis pipeline for improving ASR in low-resource accented speech scenarios. The pipeline adapts a TTS decoder to a target accent using only a few reference utterances and leverages LLM-based phoneme editing to generate accent-conditioned pronunciations. Fine-tuning a self-supervised ASR model with the synthesized speech results in significant WER reductions on real accented speech, even in ultra-low data regimes.
LLMs can guide phoneme editing to create synthetic accented speech from just a handful of examples, substantially improving ASR accuracy where training data is scarce.
Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.