Search papers, labs, and topics across Lattice.
This paper introduces a learning-free accented text-to-speech (TTS) framework that leverages phonological rules applied to phoneme sequences in conjunction with a multilingual TTS model. The method transforms accent at the phoneme level without requiring accented training data, enabling fine-grained control over accent while preserving intelligibility. Rule sets were designed for Spanish- and Indian-accented English, modeling phonological differences in consonants, vowels, and syllable structure.
Achieve accent-specific speech synthesis without any accented training data by cleverly combining phonological rules with multilingual TTS.
Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.