DAMOPKUMay 28, 2026arXiv:2605.29860

ESPO: Early-Stopping Proximal Policy Optimization

Zihang Li, Rui Zhou, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li

AI Summary

This paper introduces ESPO (Early-Stopping Proximal Policy Optimization), an innovative approach that identifies and terminates failed trajectories in reinforcement learning for large language models, thereby preventing unnecessary computation on unproductive tokens. By calculating a surrogate regret based on already computed logits, ESPO effectively reduces noise in advantage estimates and focuses learning on successful outcomes. The method was empirically validated on the DeepSeek-R1-Distill-Qwen-7B model, achieving superior performance compared to standard PPO while also reducing token usage by over 20%.

Key Contribution

Early-stopping can save over 20% of compute while improving reasoning accuracy in large language models.

Abstract

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ESPO: Early-Stopping Proximal Policy Optimization

Related Papers