FudanFeb 15, 2026arXiv:2602.14098

ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

Youqi Wang, Shen Chen, Haowei Wang, Rongxuan Peng, Taiping Yao, Shunquan Tan, Changsheng Chen, Bin Li, Shouhong Ding

AI Summary

The paper introduces ForgeryVCR, a visual-centric reasoning framework for image forgery detection and localization that addresses the limitations of text-centric MLLMs in capturing fine-grained tampering traces. ForgeryVCR incorporates a forensic toolbox to transform imperceptible traces into explicit visual intermediates, enabling more effective analysis. The authors employ a Strategic Tool Learning post-training paradigm, using gain-driven trajectory construction via SFT and RL with a tool utility reward, to optimize the MLLM's tool usage.

Key Contribution

MLLMs can now spot subtle image forgeries with SOTA accuracy by strategically using forensic tools to expose hidden inconsistencies, outperforming traditional text-centric approaches.

Abstract

Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

Related Papers