Mar 29, 2026arXiv:2603.27693

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

AI Summary

The paper introduces LVRPO, a language-visual reinforcement-based preference optimization framework that aligns language and visual representations using Group Relative Policy Optimization (GRPO). LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, avoiding auxiliary encoders or handcrafted cross-modal objectives. Experiments show LVRPO outperforms strong unified-pretraining baselines across multimodal understanding, generation, and reasoning tasks.

Key Contribution

Forget auxiliary encoders and handcrafted losses: LVRPO uses reinforcement learning to directly align language and vision, boosting performance across a range of multimodal tasks.

Abstract

Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Related Papers