Stanford HAINTUShanghai AI LabShanghai Jiaotong UniversityMar 13, 2026arXiv:2603.12648

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Jiazi Bu, Jiazi Bu, Pengyang Ling, Pengyang Ling, Yujie Zhou, Yujie Zhou, Yuhang Zang, Tianyi Wei, Tianyi Wei, Xiaohang Zhan, Xingang Pan

AI Summary

Multi-View GRPO (MV-GRPO) addresses the limitations of standard Group Relative Policy Optimization (GRPO) in text-to-image flow models by augmenting the condition space to create a dense multi-view reward mapping. MV-GRPO uses a Condition Enhancer to generate semantically adjacent captions from a single prompt, enabling multi-view advantage re-estimation and capturing diverse semantic attributes. Experiments show MV-GRPO achieves superior alignment performance compared to state-of-the-art methods by incorporating the probability distribution of original samples conditioned on the new captions without sample regeneration.

Key Contribution

Text-to-image flow models can achieve superior preference alignment by augmenting the condition space, creating a "dense" reward mapping that better captures inter-sample relationships.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

Computer Vision Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Related Papers