Search papers, labs, and topics across Lattice.
The authors introduce Spoof-SUPERB, a benchmark for evaluating self-supervised learning (SSL) models on audio deepfake detection across diverse architectures (generative, discriminative, spectrogram-based). They systematically evaluated 20 SSL models on in-domain and out-of-domain datasets, finding that large-scale discriminative models like XLS-R, UniSpeech-SAT, and WavLM Large exhibit superior performance and robustness. The benchmark provides a reproducible baseline and insights into SSL representation reliability for securing speech systems against audio deepfakes.
Securing speech systems against deepfakes requires large-scale discriminative SSL models, as shown by the Spoof-SUPERB benchmark, which reveals their superior performance and robustness compared to generative approaches.
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.