Search papers, labs, and topics across Lattice.
The paper introduces DARS, a dysarthria-aware rhythm-style synthesis framework built upon Matcha-TTS, to generate synthetic dysarthric speech for ASR augmentation. DARS uses a multi-stage rhythm predictor trained with contrastive preferences and a dysarthric-style conditional flow matching mechanism to model pathological rhythm and acoustic style. Experiments on the TORGO dataset show that DARS-generated speech significantly reduces WER by 54.22% when used to adapt a Whisper-based ASR system, outperforming existing data augmentation techniques.
Synthesizing realistic dysarthric speech slashes ASR error rates by over 50%, thanks to a novel rhythm-style modeling approach.
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22 % relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.