ANUShanghai AI LabApr 1, 2026arXiv:2604.00479

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Xinyu Tian, Zhaoyuan Yang, Mengqi He, Peter Tu

AI Summary

The paper investigates why Reinforcement Learning (RL) methods like GRPO improve reasoning in Vision-Language Models (VLMs) and finds that GRPO leads to a collapse in the diversity of reasoning strategies. To counter this, they introduce Multi-Group Policy Optimization (MUPO), which encourages divergent thinking by optimizing multiple groups of policies. MUPO demonstrates improved performance on standard VLM benchmarks by preventing premature convergence to suboptimal reasoning paths.

Key Contribution

RL's success in boosting VLM reasoning hides a critical flaw: it crushes the model's ability to explore diverse solutions, leading to premature convergence and hindering scalability.

Abstract

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Related Papers