CASHello Group Inc.Apr 9, 2026arXiv:2604.08363

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Xiaosu Su, Zihan Sun, Peilei Jia, Jun Gao

AI Summary

This paper introduces CapTalk, a caption-conditioned text-to-speech framework for unified voice design across single-utterance and dialogue settings. It uses utterance-level and speaker-level captions for voice design and speaker modeling, respectively, and incorporates a Chain-of-Thought (CoT) control sequence for turn-level attribute planning in dialogues. A hierarchical variational conditioning module balances timbre preservation and context-adaptive expression, achieving state-of-the-art performance in both single-utterance voice design and multi-turn dialogue scenarios.

Key Contribution

Finally, a voice design model that can handle both single utterances and multi-turn dialogues with improved expression controllability and contextual awareness.

Abstract

Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Related Papers