BAAICASApr 22, 2026arXiv:2604.20733

Near-Future Policy Optimization

Chuanyu Qin, Chen Yang, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Nan Duan, Jiaqi Wang, Jiaqi Wang

AI Summary

This paper introduces Near-Future Policy Optimization (NPO), a mixed-policy reinforcement learning method that leverages checkpoints from a policy's own training history as a source of off-policy trajectories. NPO balances trajectory quality and variance by using "near-future" policies, which are both stronger and closer to the current policy than external sources. Experiments on Qwen3-VL-8B-Instruct with GRPO show that NPO and its adaptive variant, AutoNPO, improve average performance and accelerate convergence by strategically bootstrapping and breaking plateaus.

Key Contribution

Forget external teachers – the best way to boost your RL policy might be learning from its future self.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Near-Future Policy Optimization

Related Papers