ZJUMay 29, 2026arXiv:2605.30993

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Ruiqi Li, Yu Zhang, Changhao Pan, Ke Lei, Xiang Yin

AI Summary

The paper introduces SwanVoice, a zero-shot TTS model designed for expressive long-form speech synthesis in both monologue and dialogue settings, addressing limitations in existing systems regarding acoustic consistency and speaker control. SwanVoice leverages a novel training pipeline involving SwanData-Speech, a corpus built from in-the-wild audio with pause-aware alignment, and a flow-matching DiT architecture conditioned on speaker turns. Evaluated on SwanBench-Speech, SwanVoice achieves superior richness and hierarchical structure compared to open-source baselines, demonstrating advancements in expressive coherence and speaker switching.

Key Contribution

SwanVoice leaps ahead in zero-shot TTS by nailing expressive, multi-speaker dialogue with a single model, finally bridging the gap between monologue quality and conversational coherence.

Abstract

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Related Papers