Mar 11, 2026arXiv:2603.11321

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

Yuning Wu, Ke Wang, Devin Chen, Kaichen Wei

AI Summary

Hindsight-Anchored Policy Optimization (HAPO) addresses the challenge of advantage collapse and distributional bias in sparse-reward RLVR settings by selectively anchoring policy optimization to teacher demonstrations during failure states. HAPO uses a Synthetic Success Injection (SSI) operator, a hindsight mechanism guided by a Thompson sampling-inspired gating mechanism, to create a self-paced curriculum. Theoretical analysis demonstrates that HAPO achieves asymptotic consistency by annealing the teacher signal, recovering the unbiased on-policy gradient and allowing the model to surpass teacher limitations.

Key Contribution

By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

Related Papers