Search papers, labs, and topics across Lattice.
BandPO addresses the limitations of PPO's fixed clipping bounds in LLM reinforcement learning, which disproportionately suppresses low-probability, high-advantage actions. It introduces a "Band" operator that dynamically adjusts clipping intervals based on action probabilities, effectively projecting f-divergence trust regions. This is formulated as a convex optimization problem, and experiments show BandPO outperforms PPO and Clip-Higher in mitigating entropy collapse across various models and datasets.
PPO's fixed clipping hurts exploration by squashing high-reward, low-probability actions, but BandPO fixes this with probability-aware bounds that boost performance.
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.