TongjiJun 8, 2026arXiv:2606.09141

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

Hanke Xie, Xiaming Ren, Dake Guo, Ruonan You, Wenhao Li, Jingbin Hu, Guobin Ma, Huakang Chen, Kejie Xu, Rui Huang, Weiguo Tan, Xianrong Wang, Lei Xie, Lei Xi

AI Summary

FlashTTS is a novel low-latency streaming Text-to-Speech (TTS) framework designed to meet the demands of modern speech dialogue systems by eliminating sentence-level buffering. By employing a lagged multi-track architecture and integrating parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder, FlashTTS achieves significant reductions in First-Packet Latency to 325ms while maintaining high fidelity in token-to-mel generation. This approach not only enhances real-time responsiveness but also supports robust zero-shot voice cloning and cross-lingual intelligibility, marking a substantial advancement over existing TTS systems.

Key Contribution

FlashTTS slashes First-Packet Latency to 325ms, revolutionizing real-time speech dialogue systems without sacrificing voice quality.

Abstract

Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.

Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

Related Papers