Search papers, labs, and topics across Lattice.
This paper introduces Sibling-Guided Credit Distillation (SGCD), a novel approach to enhance long-horizon tool-use in reinforcement learning by improving credit assignment through dynamic sampling of sibling rollouts. Unlike traditional self-distillation methods that can inadvertently amplify harmful shortcuts, SGCD leverages an external LLM to summarize contrasting outcomes, effectively guiding the learning process without exposing the deployed agent to external influences. The results demonstrate significant performance improvements in both AppWorld and $蟿^3$-airline environments, indicating SGCD's efficacy in refining agent behavior and decision-making.
Direct token-level self-distillation can backfire, but Sibling-Guided Credit Distillation redefines credit assignment to enhance long-horizon tool-use without amplifying harmful behaviors.
Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $蟿^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $蟿^3$-airline pass@1 $0.583 \to 0.602$.