Search papers, labs, and topics across Lattice.
This paper benchmarks the robustness of several acoustic foundation models (Whisper, FunASR, Wav2Vec2, HuBERT, WavLM) on post-exercise speech, which contains artifacts like micro-breaths and unstable phonation. They find that FunASR exhibits the strongest baseline robustness, while fine-tuning improves CTC-based models but leads to unstable adaptation in Whisper. The study also highlights the importance of separating fluency-related effects from exercise-induced speech variation in future research.
Turns out your fancy speech recognition model might stumble after a workout: performance degrades significantly on post-exercise speech, and the best model varies depending on whether you fine-tune it.
Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on Post-All. Fine-tuning substantially improves several CTC-based models, whereas Whisper shows unstable adaptation. As an exploratory case study, we further stratify results by fluent and non-fluent speakers; although the non-fluent subset is small, it is consistently more challenging than the fluent subset. Overall, our findings show that post-exercise ASR robustness is strongly model-dependent, that in-domain adaptation can be highly effective but not uniformly stable, and that future post-exercise ASR studies should explicitly separate fluency-related effects from exercise-induced speech variation.