Independent ResearcherTencent AIMay 26, 2026arXiv:2605.27140

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

AI Summary

StepOPSD addresses the credit assignment problem in multi-turn agent RL by introducing a step-aware online preference distillation framework. It decomposes trajectories into action-centered segments, rescoring them with hindsight-enriched teacher contexts and converting log-probability gaps into advantage shaping. Experiments on ALFWorld and Search-QA using Qwen models demonstrate that StepOPSD achieves state-of-the-art or competitive performance, particularly in tasks sensitive to local causal errors.

Key Contribution

StepOPSD shows that focusing on individual agent steps, rather than entire trajectories, unlocks significant performance gains in multi-turn agent reinforcement learning.

Abstract

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller α_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength λ_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

Inference & Quantization RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...