ASUGeorgia TechOhio StateUPennJun 15, 2026arXiv:2606.16215

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

AI Summary

This paper introduces PACT, a Privileged Trace Co-Training framework designed to enhance multi-turn tool-use agents by leveraging expert traces as optimization signals rather than direct rollout hints. By employing a trace-conditioned reinforcement learning surrogate and a component-aware supervised fine-tuning loss, PACT effectively balances the benefits of expert guidance while maintaining flexibility in rollout generation. Experimental results demonstrate that PACT outperforms traditional supervised fine-tuning and reinforcement learning approaches, underscoring its efficacy in improving agent performance in complex multi-turn interactions.

Key Contribution

Expert traces can enhance multi-turn tool-use agents without constraining their rollout generation, leading to significant performance gains.

Abstract

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

Related Papers