Search papers, labs, and topics across Lattice.
This paper investigates gender bias in audio deepfake detection using the ASVspoof 2015 dataset and a ResNet-18 classifier trained on various audio features. The study compares the performance of the ResNet-18 model with the AASIST baseline, evaluating both detection accuracy and fairness across genders. The key finding is that while overall Equal Error Rate (EER) differences between genders may appear small, fairness metrics reveal significant disparities in error distribution, highlighting the limitations of relying solely on aggregate performance measures.
Even when overall accuracy seems balanced, audio deepfake detection models can exhibit significant gender bias, masked by standard metrics like EER.
Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.