NVIDIAJun 16, 2026arXiv:2606.18485

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

AI Summary

This paper introduces MagpieTTS-LF, an innovative inference-time method for generating long-form speech that avoids the pitfalls of prosodic drift and speaker inconsistencies without requiring retraining on long-form data. By employing soft attention priors, a stateful inference algorithm, and history-aware text encoding, the system maintains coherence and continuity across extended speech segments. Experimental results demonstrate substantial enhancements in intelligibility, prosodic coherence, speaker consistency, and naturalness of sentence boundaries compared to existing TTS baselines.

Key Contribution

Long-form speech generation can now achieve remarkable coherence and naturalness without the need for extensive retraining on long-form datasets.

Abstract

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

Related Papers