Tsinghua AIFudanHFUTHunyuan TeamShanghai InnovationShanghai Qi Zhi InstituteSJTUTencent AIJun 1, 2026arXiv:2606.01802

MOSS-Audio Technical Report

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Jun Zhan, Kang Yu, Kexin Huang, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Yang Gao, Yiyang Zhang, Xipeng Qiu

AI Summary

MOSS-Audio is a novel audio-language model designed to enhance understanding of speech, environmental sounds, and music through a unified framework that integrates a dedicated audio encoder, a modality adapter, and a large language model. Key innovations include DeepStack cross-layer feature injection for improved acoustic information access and the use of timestamp markers for temporal grounding, enabling advanced capabilities like audio captioning and time-aware question answering. The model demonstrates strong performance in various tasks, establishing itself as a robust foundation for future voice agents and audio understanding applications.

Key Contribution

MOSS-Audio achieves state-of-the-art performance in audio understanding tasks by effectively integrating temporal cues and deep acoustic features, setting a new benchmark for audio-language models.

Abstract

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MOSS-Audio Technical Report

Related Papers