CASApr 15, 2026arXiv:2604.14054

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Yaocheng Zhang, Yuanheng Zhu, Yuanheng Zhu, W. Chong, Wenyue Chong, Songjun Tu, Songjun Tu, Qichao Zhang, Qichao Zhang, Jiajun Chai, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Guojun Yin, Dongbin Zhao

AI Summary

This paper introduces Privileged Information Self-Play ($\pi$-Play), a multi-agent self-play framework that leverages the question construction path (QCP) generated during self-play as privileged information for self-distillation. An examiner generates tasks and their QCPs, which the teacher model uses to supervise a student, transforming sparse-reward self-play into a dense-feedback loop. Experiments demonstrate that $\pi$-Play outperforms fully supervised search agents and improves evolutionary efficiency by 2-3x compared to conventional self-play, all without external data.

Key Contribution

Self-play can be dramatically improved by exploiting the "question construction path" it generates as privileged information for self-distillation, leading to 2-3x faster learning.

Abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($\pi$-Play), a multi-agent self-evolution framework. In $\pi$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $\pi$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.

Data Curation & Synthetic Data Tool Use & Agents Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Related Papers