May 27, 2026arXiv:2605.28459

REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

Jun Zhou, Bingwen Hu, Yaxiong Wang, Zhedong Zheng, Yongzhen Wang, Ping Liu

AI Summary

The paper introduces REVEAL, a reference-grounded framework for multimodal manipulation detection that compares query image-text pairs against a large library of authentic examples. REVEAL uses a difference-aware fusion mechanism to identify subtle discrepancies and a task-decoupled Mixture-of-Experts architecture for simultaneous detection and localization of manipulated regions. Experiments show REVEAL significantly outperforms existing methods and enables training-free domain adaptation by updating the reference library, demonstrating robustness against evolving misinformation.

Key Contribution

Forget memorizing manipulation artifacts: comparing against a reference library of authentic examples lets you spot forgeries and adapt to new domains without retraining.

Abstract

Multimodal manipulation detection aims to simultaneously identify forged image--text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image--text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emph{training-free domain adaptation} by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at https://anonymous.4open.science/r/REVEAL-Reference-A006.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

Related Papers