Search papers, labs, and topics across Lattice.
This paper investigates the impact of speech enhancement (SE) on audio deepfake detection performance in noisy environments by corrupting the ASVspoof 2019 LA dataset with varying SNR levels and then applying SE techniques. The study compares two SE algorithms, SEGAN and MetricGAN+, evaluating their performance using PESQ and SRMR metrics, and assessing their impact on the Equal Error Rate (EER) of a spoofing detection system. Counterintuitively, the SE algorithm (SEGAN) that yielded lower speech quality scores resulted in better spoofing detection performance (lower EER) compared to MetricGAN+, which achieved higher speech quality scores.
Speech enhancement can paradoxically *hurt* audio deepfake detection, as the algorithm that improved perceptual speech quality the most actually *reduced* the detection accuracy.
Logical Access (LA) attacks, also known as audio deepfake attacks, use Text-to-Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech-to-Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal-to-Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications' performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task.