MBZUAIFeb 23, 2026arXiv:2602.19715

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan, Abhinav Dhall

AI Summary

The paper introduces DeepfakeJudge, a framework for evaluating and improving the reasoning fidelity of deepfake detection models by using a bootstrapped generator-evaluator process with a meta-evaluation benchmark. This framework addresses the issue of deepfake detectors producing natural language explanations that are often not grounded in visual evidence. The proposed reasoning-bootstrapped model achieves 96.2% accuracy on the meta-evaluation benchmark and demonstrates high correlation with human ratings, indicating improved reasoning fidelity in deepfake detection.

Key Contribution

Deepfake detectors can be tricked into "reasoning" without visual grounding, but DeepfakeJudge offers a way to bootstrap MLLMs to generate faithful explanations and improve detection accuracy.

Abstract

Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Related Papers