Search papers, labs, and topics across Lattice.
This paper introduces PACT, a Privileged Trace Co-Training framework designed to enhance multi-turn tool-use agents by leveraging expert traces as optimization signals rather than direct rollout hints. By employing a trace-conditioned reinforcement learning surrogate and a component-aware supervised fine-tuning loss, PACT effectively balances the benefits of expert guidance while maintaining flexibility in rollout generation. Experimental results demonstrate that PACT outperforms traditional supervised fine-tuning and reinforcement learning approaches, underscoring its efficacy in improving agent performance in complex multi-turn interactions.
Expert traces can enhance multi-turn tool-use agents without constraining their rollout generation, leading to significant performance gains.
Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.