Tsinghua AICASJul 7, 2025arXiv:2507.05177

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

AI Summary

The authors introduce OpenS2S, a fully open-source end-to-end large speech language model (LSLM) for empathetic speech interactions. OpenS2S builds upon the BLSP-Emo model and uses a streaming interleaved decoding architecture for low-latency speech generation. To enable end-to-end training, they develop an automated data construction pipeline that leverages LLMs and controllable TTS systems to synthesize a diverse and scalable empathetic speech dialogue corpus.

Key Contribution

OpenS2S offers researchers a fully transparent, end-to-end LSLM for empathetic speech interaction, filling a critical gap left by increasingly closed-off state-of-the-art models.

Abstract

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

Natural Language Processing Open-Source Models & Weights Speech & Audio

Citation Metrics

Citations4

Influential citations0

References62

Year2025

VenueProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Related Papers

Finding related papers...

Search

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Related Papers