Moonstep AINorthwesternSoul AI LabZJUJun 1, 2026arXiv:2606.02400

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Yuhang Dai, Haopeng Lin, Zhennan Lin, Jiale Qian, Jun Wu, Hanke Xie, Hao Meng, Hanlin Wen, Chuang Ding, Shunshun Yin, Ming Tao, Lei Xie, Xinsheng Wang

AI Summary

SoulX-Transcriber is an innovative multi-speaker transcription system that integrates speaker diarization and automatic speech recognition within a unified LLM-based framework. By employing a two-stage training strategy, it enhances speaker representation and boundary perception during the initial phase, followed by supervised fine-tuning for accurate transcription in complex audio environments. The system demonstrates significant robustness and performance improvements across various benchmarks, showcasing its adaptability to diverse conversational scenarios.

Key Contribution

Achieving high accuracy in multi-speaker transcription, SoulX-Transcriber outperforms existing models by effectively addressing speaker overlap and rapid turn-taking.

Abstract

Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.

Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

Related Papers