Google ResearchIIScMar 3, 2026arXiv:2603.02813

Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge

Dhanya E, E. Dhanya, A. Meena, Ankita Meena, Manas Nanivadekar, Noumida A, A. Noumida, Victor Azad, A. Shenoy, Ashwini Nagaraj Shenoy, Pratik Roy Chowdhuri, Shobhit Banga, Vanshika Chhabra, Vansh Chhabra, Chitralekha Bhat, Shareef Babu Kalluri, Srikanth Raj Chetupalli, Deepu Vijayasenan, Sriram Ganapathy

AI Summary

The DISPLACE-M challenge introduces a new benchmark dataset and evaluation framework for conversational AI systems designed to understand real-world medical dialogues in Indian languages. The dataset comprises 25 hours of development data and 10 hours of evaluation data featuring multi-speaker interactions with spontaneous, noisy, and overlapping speech. Baseline systems were provided for speaker diarization, ASR, topic identification, and dialogue summarization, but the challenge results from 12 participating teams indicate that current systems are not yet ready for healthcare deployment.

Key Contribution

Despite dedicated efforts from multiple teams, existing speech systems still fall significantly short of deployment readiness for understanding real-world medical conversations in Indian languages, highlighting the need for further research.

Abstract

The DIarization and Speech Processing for LAnguage understanding in Conversational Environments - Medical (DISPLACE-M) challenge introduces a conversational AI benchmark focused on understanding goal-oriented, real-world medical dialogues collected in the field. The challenge addresses multi-speaker interactions between healthcare workers and seekers characterized by spontaneous, noisy and overlapping speech across Indian languages and dialects. As part of the challenge, medical conversational dataset comprising 25 hours of development data and 10 hours of blind evaluation recordings was released. We provided baseline systems within a unified end-to-end pipeline across 4 tasks - speaker diarization, automatic speech recognition, topic identification and dialogue summarization - to enable consistent benchmarking. System performance is evaluated using established metrics such as diarization error rate (DER), time-constrained minimum-permutation word error rate (tcpWER), and ROUGE-L. During this evaluation (Phase-I), 12 teams, across the globe, actively participated pushing the baseline systems on these metrics. However, even with a 6-8 week dedicated effort from various participants, the task is shown to be substantially challenging, and the existing systems are significantly short of healthcare deployment readiness.

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge

Related Papers