Search papers, labs, and topics across Lattice.
SoulX-Transcriber is an innovative multi-speaker transcription system that integrates speaker diarization and automatic speech recognition within a unified LLM-based framework. By employing a two-stage training strategy, it enhances speaker representation and boundary perception during the initial phase, followed by supervised fine-tuning for accurate transcription in complex audio environments. The system demonstrates significant robustness and performance improvements across various benchmarks, showcasing its adaptability to diverse conversational scenarios.
Achieving high accuracy in multi-speaker transcription, SoulX-Transcriber outperforms existing models by effectively addressing speaker overlap and rapid turn-taking.
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.