Search papers, labs, and topics across Lattice.
This paper investigates the impact of using Meta's SAM-Audio speech enhancement model as a preprocessing step for zero-shot ASR with Whisper across Bengali and English datasets. Surprisingly, they find that SAM-Audio consistently degrades ASR performance (WER and CER) despite improving signal-level quality (PSNR). Utterance-level analysis reveals that the degradation is systematic and worsens with larger Whisper models, indicating a mismatch between human-perceptible audio quality and machine recognition robustness.
Perceptually "cleaner" audio, achieved through state-of-the-art denoising, can actually *harm* zero-shot ASR performance, challenging the assumption that better audio quality always translates to better recognition.
Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.