B-itCUHKDukeFortemedia SingaporeJHUKAISTMelbourneUMDJun 9, 2026arXiv:2606.10791

Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

Xueping Zhang, Han Yin, Yang Xiao, Ting Dang, Rohan Kumar Das, Ming Li

AI Summary

The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) evaluated advanced systems for detecting audio spoofing across five components, revealing significant insights into effective design strategies. The top-performing system achieved a Macro-F1 score of 0.8775, showcasing the advantages of modular task decomposition and cross-domain self-supervised encoders over traditional model scaling approaches. Despite these advancements, challenges remain in detecting spoofed environmental sounds and generalizing to new generators, highlighting areas for future research.

Key Contribution

Top systems in the ESDD2 challenge achieved a staggering Macro-F1 score of 0.8775, revealing the power of modular design and self-supervised learning in audio deepfake detection.

Abstract

The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

Related Papers