Apr 9, 2026arXiv:2604.08450

DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

Yassine El Kheir, Y. E. Kheir, Arnab Das, Arnab Das, Yixuan Xiao, Yixuan Xiao, Xin Wang, Xin Wang, Feidi Kallel, Feidi Kallel, Enes Erdem Erdogan, Enes Erdem Erdogan, Ngoc Thang Vu, Ngoc Thang Vu, Tim Polzehl, Tim Polzehl, Sebastian Moeller, Sebastian Moeller

AI Summary

The paper introduces DeepFense, an open-source PyTorch toolkit for deepfake audio detection, integrating various architectures, loss functions, and augmentation pipelines with over 100 recipes. Through a large-scale evaluation of over 400 models within DeepFense, the authors found that the choice of pre-trained front-end feature extractor is the primary driver of performance. The study also reveals significant biases in high-performing models related to audio quality, speaker gender, and language, highlighting the need for equitable training data selection and front-end fine-tuning.

Key Contribution

The best deepfake audio detectors are surprisingly biased by audio quality, speaker gender, and language, undermining their real-world reliability.

Abstract

Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.

Eval Frameworks & Benchmarks Open-Source Models & Weights Speech & Audio

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

Related Papers