Mar 13, 2026arXiv:2603.13518

VoXtream2: Full-stream TTS with dynamic speaking rate control

AI Summary

VoXtream2 is introduced as a zero-shot full-stream TTS model enabling dynamic speaking-rate control and on-the-fly updates for interactive systems. It leverages a distribution matching mechanism over duration states combined with classifier-free guidance to enhance controllability and synthesis quality. The model achieves competitive performance on zero-shot benchmarks and operates at 4x real-time speed with low latency, despite its smaller size and reduced training data.

Key Contribution

Control speaking rate on the fly in your TTS system with VoXtream2, which hits 4x real-time speeds and 74ms latency.

Abstract

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VoXtream2: Full-stream TTS with dynamic speaking rate control

Related Papers