CohereMar 17, 2026arXiv:2603.16280

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Zihao Zheng, Wen Wu, Chao Zhang, Mengyue Wu, Xuenan Xu

AI Summary

CAST-TTS introduces a unified TTS framework that uses cross-attention to control timbre from both speech and text prompts. The method employs pre-trained encoders for feature extraction and a multi-stage training strategy to align speech and text representations in a shared embedding space. Experiments demonstrate that this unified cross-attention mechanism achieves comparable performance to specialized single-input models, while simplifying the overall architecture.

Key Contribution

Ditch the separate models: CAST-TTS uses a single cross-attention mechanism to control TTS timbre from both speech and text, rivaling specialized models in quality.

Abstract

Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Related Papers