CMU MLMar 4, 2026arXiv:2603.03741

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

AI Summary

The paper addresses the challenge of training robots to collaborate with humans in HRC scenarios, where heterogeneity between agents leads to instability in decentralized MARL. They introduce HALyPO, a Lyapunov-based policy optimization method that enforces stability in policy parameter space by ensuring a per-step decrease in a parameter-space disagreement metric. By rectifying decentralized gradients via optimal quadratic projections, HALyPO achieves monotonic contraction of the rationality gap, leading to improved generalization and robustness demonstrated in both simulations and real-world robot experiments.

Key Contribution

HALyPO stabilizes human-robot collaboration by directly certifying the convergence of decentralized policy learning in parameter space, sidestepping the oscillations that plague standard MARL approaches.

Abstract

To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Related Papers