Apr 30, 2026arXiv:2604.28022

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa, Hugo Proencca, Hugo Proença, Tiago Roxo

AI Summary

This paper introduces a novel evaluation setup for DeepFake detection that focuses on semantic inconsistencies between audio and video, rather than solely relying on data source integrity. They extend the standard four-class DeepFake formulation by adding a "Real Audio-Real Video with Semantic Mismatch" (RARV-SMM) class. Experiments on FakeAVCeleb and LAV-DF datasets reveal that state-of-the-art models struggle with RARV-SMM, and a semantic reinforcement strategy using ImageBind embeddings improves detection performance in both the proposed and standard settings.

Key Contribution

Current DeepFake detectors can be fooled by semantically inconsistent real audio and video, highlighting a critical blind spot in their ability to assess realistic manipulations.

Abstract

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to improve DeepFake detection in both our proposed and state-of-the-art settings, on FakeAVCeleb and LAV-DF, paving the way to more realistic DeepFake detectors. The source code and data are available at https://github.com/.

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Related Papers