Search papers, labs, and topics across Lattice.
The paper introduces Forensic Answer-Questioning (FAQ), a large-scale multiple-choice benchmark designed to evaluate and enhance the temporal reasoning abilities of Vision-Language Models (VLMs) in detecting video deepfakes. FAQ assesses VLMs across three levels: facial perception, temporal deepfake grounding, and forensic reasoning, focusing on identifying dynamic inconsistencies rather than just static artifacts. Fine-tuning VLMs on the generated instruction-tuning set, FAQ-IT, significantly improves performance on both in-domain and cross-dataset deepfake detection benchmarks, demonstrating the effectiveness of FAQ in fostering temporal reasoning.
VLMs can now reason about temporal inconsistencies in video deepfakes, thanks to a new benchmark that moves beyond static artifact detection.
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.