Search papers, labs, and topics across Lattice.
The paper introduces CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model based on a Conformer-CTC architecture, designed to bridge large language models and text-to-speech systems. The model processes grapheme tokens in chunks, enabling streaming inference of phonemic and prosodic labels while maintaining minimal look-ahead for stable predictions. Experiments on a Japanese dataset demonstrate that CC-G2PnP achieves significantly higher accuracy in PnP label prediction compared to baseline streaming G2PnP models, particularly due to its ability to handle unsegmented languages via CTC-based alignment.
Streaming grapheme-to-phoneme conversion for unsegmented languages is now possible thanks to a Conformer-CTC architecture that learns grapheme-phoneme alignments without needing explicit word boundaries.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.