Feb 11, 2026arXiv:2602.10693

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

AI Summary

The paper addresses the instability issues in off-policy RL training of LLMs caused by policy staleness and distribution shifts. They introduce Variational sEquence-level Soft Policy Optimization (VESPO), which uses a variational formulation to derive a closed-form reshaping kernel for sequence-level importance weights, mitigating variance without length normalization. Experiments on mathematical reasoning demonstrate that VESPO enables stable training with high staleness ratios and asynchronous execution, improving performance on both dense and MoE models.

Key Contribution

VESPO stabilizes off-policy RL training for LLMs by directly reshaping sequence-level importance weights, tolerating 64x policy staleness and asynchronous execution without collapse.

Abstract

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Related Papers