Amazon ScienceJun 10, 2026arXiv:2606.12634

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

AI Summary

This paper introduces Sibling-Guided Credit Distillation (SGCD), a novel approach to enhance long-horizon tool-use in reinforcement learning by improving credit assignment through dynamic sampling of sibling rollouts. Unlike traditional self-distillation methods that can inadvertently amplify harmful shortcuts, SGCD leverages an external LLM to summarize contrasting outcomes, effectively guiding the learning process without exposing the deployed agent to external influences. The results demonstrate significant performance improvements in both AppWorld and $τ^3$-airline environments, indicating SGCD's efficacy in refining agent behavior and decision-making.

Key Contribution

Direct token-level self-distillation can backfire, but Sibling-Guided Credit Distillation redefines credit assignment to enhance long-horizon tool-use without amplifying harmful behaviors.

Abstract

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

Related Papers