Search papers, labs, and topics across Lattice.
The paper introduces ForgeryVCR, a visual-centric reasoning framework for image forgery detection and localization that addresses the limitations of text-centric MLLMs in capturing fine-grained tampering traces. ForgeryVCR incorporates a forensic toolbox to transform imperceptible traces into explicit visual intermediates, enabling more effective analysis. The authors employ a Strategic Tool Learning post-training paradigm, using gain-driven trajectory construction via SFT and RL with a tool utility reward, to optimize the MLLM's tool usage.
MLLMs can now spot subtle image forgeries with SOTA accuracy by strategically using forensic tools to expose hidden inconsistencies, outperforming traditional text-centric approaches.
Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.