Tsinghua AICASGeorge Mason UniversityNTUTencent AIUSTCV) setting. Figure 6: Fine-grainedMay 21, 2026arXiv:2605.22072

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

AI Summary

Faithful-MR1 addresses the challenge of faithful multimodal reasoning in MLLMs by explicitly anchoring visual attention to image regions and reinforcing the use of relevant visual evidence during reasoning. They introduce an Anchoring stage that supervises a dedicated <Focus> token's attention against image regions, and a Reinforcing stage that uses counterfactual image intervention to reward answer-correct trajectories that concentrate visual attention where it causally matters. Experiments show Faithful-MR1 outperforms recent multimodal reasoning baselines on Qwen2.5-VL-Instruct 3B and 7B backbones with less training data.

Key Contribution

MLLMs can learn to reason more faithfully by explicitly anchoring visual attention to relevant image regions and reinforcing the use of that evidence during reasoning via counterfactual interventions.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Related Papers