Mar 2, 2026arXiv:2603.02030

TCG CREST System Description for the DISPLACE-M Challenge

Nikhil Raghav, Nikhil Raghav, Md. Sahidullah

AI Summary

This paper describes the TCG CREST system developed for the DISPLACE-M challenge Track 1 (speaker diarization), focusing on noisy, real-world medical conversations. The authors compared a SpeechBrain-based modular pipeline with a hybrid end-to-end neural diarization system (Diarizen) built on WavLM, exploring various clustering techniques like AHC and spectral clustering variants. Results showed that the Diarizen system significantly outperformed the SpeechBrain baseline, achieving a 9.21% DER on the evaluation set using AHC and ranking sixth in the challenge.

Key Contribution

A WavLM-based Diarizen system slashes speaker diarization error rate by 39% in noisy rural healthcare conversations, outperforming a SpeechBrain pipeline.

Abstract

This report presents the TCG CREST system description for Track 1 (Speaker Diarization) of the DISPLACE-M challenge, focusing on naturalistic medical conversations in noisy rural-healthcare scenarios. Our study evaluates the impact of various voice activity detection (VAD) methods and advanced clustering algorithms on overall speaker diarization (SD) performance. We compare and analyze two SD frameworks: a modular pipeline utilizing SpeechBrain with ECAPA-TDNN embeddings, and a state-of-the-art (SOTA) hybrid end-to-end neural diarization system, Diarizen, built on top of a pre-trained WavLM. With these frameworks, we explore diverse clustering techniques, including agglomerative hierarchical clustering (AHC), and multiple novel variants of spectral clustering, such as SC-adapt, SC-PNA, and SC-MK. Experimental results demonstrate that the Diarizen system provides an approximate $39\%$ relative improvement in the diarization error rate (DER) on the post-evaluation analysis of Phase~I compared to the SpeechBrain baseline. Our best-performing submitted system employing the Diarizen baseline with AHC employing a median filtering with a larger context window of $29$ achieved a DER of 10.37\% on the development and 9.21\% on the evaluation sets, respectively. Our team ranked sixth out of the 11 participating teams after the Phase~I evaluation.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TCG CREST System Description for the DISPLACE-M Challenge

Related Papers