Search papers, labs, and topics across Lattice.
Faithful-MR1 addresses the challenge of faithful multimodal reasoning in MLLMs by explicitly anchoring visual attention to image regions and reinforcing the use of relevant visual evidence during reasoning. They introduce an Anchoring stage that supervises a dedicated <Focus> token's attention against image regions, and a Reinforcing stage that uses counterfactual image intervention to reward answer-correct trajectories that concentrate visual attention where it causally matters. Experiments show Faithful-MR1 outperforms recent multimodal reasoning baselines on Qwen2.5-VL-Instruct 3B and 7B backbones with less training data.
MLLMs can learn to reason more faithfully by explicitly anchoring visual attention to relevant image regions and reinforcing the use of that evidence during reasoning via counterfactual interventions.
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.