UT AustinXiamen UniversityJun 10, 2026arXiv:2606.11611

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

Peijie Chen, Wenhao Guan, Weijie Wu, Kaidi Wang, Daiyu Huang, Zhuanling Zha, Junbo Li, Jun Fang, Q. Hong, Qingyang Hong, Lin Li

AI Summary

This paper introduces SARA, a dual-stream variational autoencoder (VAE) that integrates semantic and acoustic representations to enhance zero-shot text-to-speech (TTS) systems. By fusing a frozen self-supervised learning (SSL) semantic anchor with a dedicated residual acoustic encoder, SARA effectively resolves the trade-off between high-fidelity audio and linguistic accuracy, resulting in a compact latent space without the need for complex regularizers. The model demonstrates superior reconstruction quality and excels in generating natural and expressive speech in downstream TTS tasks, even under accelerated inference conditions.

Key Contribution

SARA achieves a groundbreaking balance between high-fidelity audio and precise linguistic alignment, setting a new standard for zero-shot TTS systems.

Abstract

Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.

Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

Related Papers