Search papers, labs, and topics across Lattice.
This paper introduces Semantic Consistent Evidence Pack (SCEP), a training-free framework for adapting large vision-language models (LVLMs) to image deepfake detection. SCEP identifies a compact set of suspicious image patches by clustering patch features based on semantic mismatch with the CLS token and frequency/noise anomalies. These patches are then used as evidence to condition a frozen LVLM for deepfake classification, achieving strong performance without fine-tuning.
Forget fine-tuning: this method uses smart patch selection to adapt frozen LVLMs for deepfake detection, outperforming baselines without any training.
Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.