FudanShanda AI Research TokyoUESTCUTokyoMar 18, 2026arXiv:2603.18090

MOSS-TTS Technical Report

Y. Gong, Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Kuan-Yu Chen, Yao Jiang, Yaozhou Jiang, Cheng Chang, Dongwoo Hong, Mingshu Chen, Ruixiao Li, Yiyan Zhang, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, Zhengyuan Lin, Zheng Lin, Kangrui Yu, Zi-jian Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu

AI Summary

MOSS-TTS is a speech generation foundation model trained using discrete audio tokens, autoregressive modeling, and large-scale pretraining, leveraging the MOSS-Audio-Tokenizer for compressing 24kHz audio. Two models are released: MOSS-TTS, emphasizing scalability and long-context control, and MOSS-TTS-Local-Transformer, incorporating a frame-local autoregressive module for improved efficiency and speaker preservation. The models achieve zero-shot voice cloning, token-level duration control, and stable long-form generation across multilingual and open-domain settings.

Key Contribution

Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.

Abstract

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Speech & Audio

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MOSS-TTS Technical Report

Related Papers