MIT CSAILImprobable AI LabMay 21, 2026arXiv:2605.22817

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

AI Summary

Vector Policy Optimization (VPO) is introduced as an RL algorithm that trains language models to anticipate diverse downstream reward functions by optimizing for a vector-valued reward space, unlike standard scalar reward optimization which leads to low-entropy response distributions. VPO replaces the GRPO advantage estimator and encourages the LLM to output a set of solutions specialized to different trade-offs in the vector reward space. Experiments across four tasks demonstrate that VPO matches or surpasses strong scalar RL baselines in test-time search, particularly as the search budget increases, and unlocks problems unsolvable by GRPO models in evolutionary search.

Key Contribution

LLMs trained with Vector Policy Optimization (VPO) learn to produce diverse solutions that unlock previously unsolvable problems in evolutionary search, outperforming models optimized for single scalar rewards.

Abstract

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...