FudanMar 5, 2026arXiv:2603.04918

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Yuan Li, Yuan Li, Bo Wang, Boyu Wang, Yufei Gao, Yufei Gao, Yuqian Yao, Yuqian Yao, Xinyuan Wang, Xinyuan Wang, Zhangyue Yin, Zhangyue Yin, Xipeng Qiu

AI Summary

BandPO addresses the limitations of PPO's fixed clipping bounds in LLM reinforcement learning, which disproportionately suppresses low-probability, high-advantage actions. It introduces a "Band" operator that dynamically adjusts clipping intervals based on action probabilities, effectively projecting f-divergence trust regions. This is formulated as a convex optimization problem, and experiments show BandPO outperforms PPO and Clip-Higher in mitigating entropy collapse across various models and datasets.

Key Contribution

PPO's fixed clipping hurts exploration by squashing high-reward, low-probability actions, but BandPO fixes this with probability-aware bounds that boost performance.

Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Related Papers