Tsinghua AIPKUSUSTechApr 10, 2026arXiv:2604.08865

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen

AI Summary

Sequence-Level PPO (SPPO) is introduced to address the challenges of applying PPO to long-horizon reasoning tasks with LLMs, specifically instability in credit assignment and high memory costs. SPPO reframes the reasoning process as a sequence-level contextual bandit problem, using a decoupled scalar value function for low-variance advantage estimation. Experiments on mathematical benchmarks show SPPO achieves comparable performance to group-based methods with significantly improved sample efficiency.

Key Contribution

PPO can be made sample-efficient and stable for long-horizon reasoning in LLMs by treating the problem as a sequence-level contextual bandit, sidestepping the need for computationally expensive multi-sampling.

Abstract

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Related Papers