Mar 31, 2026arXiv:2603.29339

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Detai Xin, Shujie Hu, Chengzuo Yang, Chen Yang, Chenpei Huang, Chen Huang, Guoqiao Yu, Guanglu Wan, Xunliang Cai

AI Summary

LongCat-AudioDiT, a non-autoregressive diffusion TTS model, operates directly in the waveform latent space using a Wav-VAE and diffusion backbone, avoiding intermediate acoustic representations. The authors address a training-inference mismatch and introduce adaptive projection guidance, replacing classifier-free guidance, to improve generation quality. The resulting LongCat-AudioDiT achieves SOTA zero-shot voice cloning on the Seed benchmark, even without complex pipelines or high-quality datasets, and reveals that better Wav-VAE reconstruction doesn't guarantee better TTS performance.

Key Contribution

Ditching mel-spectrograms unlocks SOTA text-to-speech with a surprisingly simple diffusion model operating directly on waveform latents.

Abstract

We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References79

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Related Papers