May 6, 2026arXiv:2605.04797

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Michael Soprano, A. Cioci, Stefano Mizzaro

AI Summary

This paper investigates the reliability of crowdsourced human judgments for detecting audiovisual deepfakes using the AV-Deepfake1M and Trusted Media Challenge datasets. The study reveals that while crowd workers rarely misclassify authentic videos, they frequently miss manipulations, especially in joint audio-video deepfakes, and exhibit limited agreement across videos. Aggregating multiple judgments improves authenticity detection but fails to recover consistently missed manipulations, highlighting challenges in reliable modality attribution.

Key Contribution

Human crowdsourcing struggles to reliably identify audiovisual deepfakes, especially when both audio and video are manipulated, suggesting current detection methods may overestimate human capabilities.

Abstract

Deepfakes are increasingly realistic and easy to produce, raising concerns about the reliability of human judgments in misinformation settings. We study audiovisual deepfake detection by measuring how consistently crowd workers distinguish authentic from manipulated videos and, when they flag a video as manipulated, how accurately they identify the manipulation type (audio-only, video-only, or audio-video) and how consistently they report manipulation timestamps. We run two matched crowdsourcing studies on Prolific using AV-Deepfake1M and the Trusted Media Challenge (TMC) dataset. We sample 48 videos per dataset (96 total) and collect 960 judgments (10 per video). Results show that crowd workers rarely misclassify authentic videos as manipulated, but they miss many manipulations, and agreement remains limited across videos. Aggregating multiple judgments per video stabilizes the authenticity signal, but it cannot recover manipulations that most workers consistently miss. Manipulation type identification is substantially noisier than authenticity detection even when workers detect a manipulation, with joint audio-video cases being particularly hard to recognize. Overall, these findings suggest that crowdsourcing can provide a scalable screening signal for audiovisual authenticity, while reliable modality attribution remains an open challenge.

Computer Vision Constitutional AI & AI Ethics Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Related Papers