National Institute of Technology SikkimUSCJun 18, 2026arXiv:2606.19797

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI Summary

This study enhances automatic speech recognition (ASR) for dysarthric speech by employing data augmentation techniques tailored to varying severity levels. By fine-tuning the Wav2Vec2 model with methods such as Speaking-Rate Modification and Pitch Modification, the researchers address the challenges posed by limited data availability. The results reveal significant improvements in word error rates (WER), with the best performance achieved through specific augmentations for each severity class, highlighting the potential of targeted data strategies in this domain.

Key Contribution

Tailored data augmentation techniques can reduce word error rates in dysarthric speech recognition by over 30%, depending on severity.

Abstract

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

Data Curation & Synthetic Data Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

Related Papers