Search papers, labs, and topics across Lattice.
StepOPSD addresses the credit assignment problem in multi-turn agent RL by introducing a step-aware online preference distillation framework. It decomposes trajectories into action-centered segments, rescoring them with hindsight-enriched teacher contexts and converting log-probability gaps into advantage shaping. Experiments on ALFWorld and Search-QA using Qwen models demonstrate that StepOPSD achieves state-of-the-art or competitive performance, particularly in tasks sensitive to local causal errors.
StepOPSD shows that focusing on individual agent steps, rather than entire trajectories, unlocks significant performance gains in multi-turn agent reinforcement learning.
Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller 伪_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength 位_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.