Mar 3, 2026arXiv:2603.02604

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang, X. Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

AI Summary

Heterogeneous Agent Collaborative Reinforcement Learning (HACRL) is introduced to improve sample efficiency in multi-agent RL by enabling agents with different architectures to share rollouts during training. HACRL uses a novel algorithm, HACPO, that incorporates mechanisms to address capability discrepancies and policy distribution shifts, ensuring unbiased advantage estimation. Experiments on reasoning benchmarks demonstrate that HACPO outperforms GSPO, achieving a 3.3% improvement with half the rollout cost, highlighting the benefits of collaborative optimization in heterogeneous agent settings.

Key Contribution

Heterogeneous agents can boost each other's performance in RL without coordinated deployment, achieving better results with less data than traditional methods.

Abstract

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.

Robotics & Embodied AI Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Heterogeneous Agent Collaborative Reinforcement Learning

Related Papers