Search papers, labs, and topics across Lattice.
This paper introduces the Clean-Referenced Feature-Vocoder Attack, a novel black-box adversarial attack on automatic speech recognition (ASR) systems that operates in the self-supervised learning (SSL) feature space rather than directly on raw audio waveforms. By perturbing more generalizable acoustic-phonetic representations, the method enhances transferability across different ASR models and circumvents defenses aimed at waveform perturbations. Experimental results demonstrate a significant improvement in word error rates (WER) against state-of-the-art baselines, highlighting a critical vulnerability in the robustness of ASR systems to adversarial attacks.
Adversarial attacks on ASR systems can achieve a +26.6 WER improvement by targeting feature representations instead of raw audio, exposing a significant blind spot in current robustness evaluations.
Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.