Search papers, labs, and topics across Lattice.
This paper investigates the trade-off between capacity and adversarial robustness in neural audio codecs used as a defense mechanism for automatic speech recognition (ASR). By varying the depth of residual vector quantization (RVQ) in the codec, the authors demonstrate a non-monotonic relationship between quantization granularity, speech content preservation, and robustness against gradient-based adversarial attacks. The key finding is that intermediate RVQ depths offer the best balance, minimizing transcription error by suppressing adversarial perturbations while maintaining speech content, and that adversarial token changes correlate with transcription errors.
A Goldilocks zone exists for neural audio codec quantization depth, where intermediate levels strike the best balance between suppressing adversarial noise and preserving speech content for robust ASR.
Adversarial perturbations exploit vulnerabilities in automatic speech recognition (ASR) systems while preserving human perceived linguistic content. Neural audio codecs impose a discrete bottleneck that can suppress fine-grained signal variations associated with adversarial noise. We examine how the granularity of this bottleneck, controlled by residual vector quantization (RVQ) depth, shapes adversarial robustness. We observe a non-monotonic trade-off under gradient-based attacks: shallow quantization suppresses adversarial perturbations but degrades speech content, while deeper quantization preserves both content and perturbations. Intermediate depths balance these effects and minimize transcription error. We further show that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error. These gains persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.