May 5, 2026arXiv:2605.03420

Deepfake Audio Detection Using Self-supervised Fusion Representations

Khalid Zaman, Qixuan Huang, Muhammad Uzair, Masashi Unoki

AI Summary

This paper introduces a dual-branch deepfake detection framework leveraging pre-trained XLS-R and BEATs models to extract speech and environmental sound representations, respectively, for the CompSpoofV2 dataset. A novel Matching Head is used to model representation differences, while multi-head cross-attention facilitates information exchange between the two branches. The system achieves an F1-score of 70.20% and an environmental EER of 16.54% on the test set, demonstrating improved performance over the baseline.

Key Contribution

Fusing speech and environmental sound representations with a novel matching head and cross-attention network significantly boosts deepfake audio detection, surpassing previous benchmarks.

Abstract

This paper describes a submission to the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component-level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual-branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS-R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi-head cross-attention enables effective information exchange between speech and environmental components. The refined representations are processed with residual connections and layer normalization, and passed to an AASIST classifier to predict speech-based and environment-based spoofing probabilities. The model outputs original, speech, and environment predictions. On the test set, the proposed system achieves an F1-score of 70.20% and an environmental EER of 16.54%, outperforming the baseline system.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Deepfake Audio Detection Using Self-supervised Fusion Representations

Related Papers