ETHUZHMay 4, 2026arXiv:2605.02782

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Pehuén Moure, Niclas Pokel, Bilal Bounajma, Yingqiang Gao, Roman Boehringer, Longbiao Cheng, Shih-Chii Liu

AI Summary

This paper introduces a benchmark, built on the Speech Accessibility Project (SAP) dataset, to evaluate whether audio-language models can leverage clinical context (diagnosis, speech ratings, clinical descriptions) to improve dysarthric speech recognition. Through prompting experiments across nine models, the authors find that current models fail to meaningfully utilize this context, often degrading word error rate (WER). However, context-dependent LoRA fine-tuning with mixed clinical prompts achieves a 52% relative WER reduction (to 0.066) while maintaining performance without context, demonstrating the potential for improvement.

Key Contribution

Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.

Abstract

Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Related Papers