Mar 29, 2026arXiv:2603.27508

Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Xiangyuan Xue, Yuyu Wang, Ruijie Yao, Xiaoyue Ni, Xiaofan Jiang, Jingping Nie

AI Summary

This paper benchmarks the robustness of several acoustic foundation models (Whisper, FunASR, Wav2Vec2, HuBERT, WavLM) on post-exercise speech, which contains artifacts like micro-breaths and unstable phonation. They find that FunASR exhibits the strongest baseline robustness, while fine-tuning improves CTC-based models but leads to unstable adaptation in Whisper. The study also highlights the importance of separating fluency-related effects from exercise-induced speech variation in future research.

Key Contribution

Turns out your fancy speech recognition model might stumble after a workout: performance degrades significantly on post-exercise speech, and the best model varies depending on whether you fine-tune it.

Abstract

Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on Post-All. Fine-tuning substantially improves several CTC-based models, whereas Whisper shows unstable adaptation. As an exploratory case study, we further stratify results by fluent and non-fluent speakers; although the non-fluent subset is small, it is consistently more challenging than the fluent subset. Overall, our findings show that post-exercise ASR robustness is strongly model-dependent, that in-domain adaptation can be highly effective but not uniformly stable, and that future post-exercise ASR studies should explicitly separate fluency-related effects from exercise-induced speech variation.

Eval Frameworks & Benchmarks Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Related Papers