AI LaboratoryAnt GroupHKUSTMay 4, 2026arXiv:2605.02730

Perceptual Flow Network for Visually Grounded Reasoning

Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang, Zhihang Zhong, Dandan Zheng, Yue Lu

AI Summary

The paper introduces Perceptual Flow Network (PFlowNet) to address language bias and hallucination in Large Vision-Language Models (LVLMs) by decoupling perception from reasoning. PFlowNet employs a self-conditioned generation process and integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning to encourage reasoning-oriented perceptual behaviors. Empirical results demonstrate that PFlowNet achieves state-of-the-art performance on V* Bench (90.6%) and MME-RealWorld-lite (67.0%), surpassing methods relying on rigid geometric priors.

Key Contribution

LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.

Abstract

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Perceptual Flow Network for Visually Grounded Reasoning

Related Papers