Apr 21, 2026arXiv:2604.19330

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

AI Summary

This paper introduces Chain-of-Details (CoD), a cascaded TTS architecture that progressively refines temporal details across multiple stages, each targeting a specific temporal granularity. A shared decoder enables parameter efficiency across resolutions, and the lowest detail level implicitly performs phonetic planning. Experiments on multiple datasets show CoD achieves competitive performance with fewer parameters compared to existing TTS approaches, leading to more natural speech synthesis.

Key Contribution

Achieve state-of-the-art TTS with significantly fewer parameters by explicitly modeling temporal dynamics in a cascaded architecture that implicitly handles phonetic planning.

Abstract

Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

Related Papers