Search papers, labs, and topics across Lattice.
The paper introduces Perceptual Flow Network (PFlowNet) to address language bias and hallucination in Large Vision-Language Models (LVLMs) by decoupling perception from reasoning. PFlowNet employs a self-conditioned generation process and integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning to encourage reasoning-oriented perceptual behaviors. Empirical results demonstrate that PFlowNet achieves state-of-the-art performance on V* Bench (90.6%) and MME-RealWorld-lite (67.0%), surpassing methods relying on rigid geometric priors.
LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.
Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).