HuaweiMay 28, 2026arXiv:2605.30201

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

AI Summary

The paper identifies a failure mode in GRPO-style reinforcement learning with sparse rewards where early updates are dominated by negative advantages, exacerbated by length normalization. To address this, they introduce Hysteretic Policy Optimization (HPO), which dampens negative advantage updates and uses mean-length normalization. Adaptive HPO (A-HPO) further automates the tuning of the hysteretic weight based on batch statistics, leading to improved reward per update, particularly in early sparse reward regimes.

Key Contribution

A-HPO significantly boosts reward acquisition in sparse-reward RL by adaptively balancing positive and negative advantage signals, outperforming GRPO, GSPO, and SAPO, especially in the critical early stages of training.

Abstract

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References14

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Related Papers