Le Mans UniversitéUniversiti Sains MalaysiaApr 14, 2026arXiv:2604.12438

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tian Su, Tianhui Su, Tien-Ping Tan, T. Tan, Salima Mdhaffar, Yannick Estève, Aghilas Sini

AI Summary

This paper introduces a non-autoregressive text-to-speech (TTS) architecture that directly models the discrete latent space of the Mimi audio codec for ultra-low latency. The architecture employs a modified FastSpeech 2 backbone with a progressive depth-wise sequential decoding strategy to condition 32 layers of residual vector quantization codes. Experiments on English and Malay datasets demonstrate a 10.6x speedup over cascaded pipelines, achieving a 48.99ms time-to-first-byte latency while improving voicing accuracy and reducing spectral degradation.

Key Contribution

Achieve real-time, high-fidelity speech synthesis with a 48ms latency by directly generating compressed audio codec tokens, bypassing the neural vocoder bottleneck.

Abstract

Real-time speech synthesis requires balancing inference latency and acoustic fidelity for interactive applications. Conventional continuous text-to-speech pipelines require computationally intensive neural vocoders to reconstruct phase information, creating a significant streaming bottleneck. Furthermore, regression-based acoustic modeling frequently induces spectral over-smoothing artifacts. To address these limitations, this paper proposes a novel end-to-end non-autoregressive architecture optimized for ultra-low latency block-wise generation, directly modeling the highly compressed discrete latent space of the Mimi neural audio codec. Integrating a modified FastSpeech 2 backbone with a progressive depth-wise sequential decoding strategy, the architecture dynamically conditions 32 layers of residual vector quantization codes. This mechanism resolves phonetic alignment degradation and manages the complexity of high-fidelity discrete representations without temporal autoregressive overhead. Experimental evaluations on English and Malay datasets validate its language-independent deployment capability. Compared to conventional continuous regression models, the proposed architecture demonstrates quantitative improvements in fundamental voicing accuracy and mitigates high-frequency spectral degradation. It achieves ultra-low latency inference, translating to a 10.6-fold absolute acceleration over conventional cascaded pipelines. Crucially, the system achieves an average time-to-first-byte latency of 48.99 milliseconds, falling significantly below the human perception threshold for real-time interactive streaming. These results firmly establish the proposed architecture as a highly optimized solution for deploying real-time streaming speech interfaces.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Related Papers