Search papers, labs, and topics across Lattice.
This paper introduces MagpieTTS-LF, an innovative inference-time method for generating long-form speech that avoids the pitfalls of prosodic drift and speaker inconsistencies without requiring retraining on long-form data. By employing soft attention priors, a stateful inference algorithm, and history-aware text encoding, the system maintains coherence and continuity across extended speech segments. Experimental results demonstrate substantial enhancements in intelligibility, prosodic coherence, speaker consistency, and naturalness of sentence boundaries compared to existing TTS baselines.
Long-form speech generation can now achieve remarkable coherence and naturalness without the need for extensive retraining on long-form datasets.
Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.