CMU MLMar 4, 2026arXiv:2603.03768

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Hao Zhang, Ding Zhao, H. E. Tseng, H. Eric Tseng

AI Summary

The paper introduces Cognition-to-Control (C2C), a hierarchical framework for human-robot collaboration that explicitly integrates deliberative reasoning with continuous control. C2C uses a vision-language model for grounding, a decentralized MARL approach for long-horizon skill coordination modeled as a Markov potential game, and a whole-body controller for execution. Experiments on collaborative manipulation demonstrate that C2C achieves higher success rates and robustness compared to single-agent and end-to-end baselines, exhibiting stable coordination and emergent leader-follower dynamics.

Key Contribution

Human-robot collaboration gets a boost from a new hierarchical framework that uses decentralized MARL to explicitly integrate high-level reasoning with low-latency control.

Abstract

Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi-agent HRC, where long-horizon coordination decisions and physical execution must co-evolve under contact, feasibility, and safety constraints. We address this limitation with cognition-to-control (C2C), a three-layer hierarchy that makes the deliberation-to-control pathway explicit: (i) a VLM-based grounding layer that maintains persistent scene referents and infers embodiment-aware affordances/constraints; (ii) a deliberative skill/coordination layer-the System 2 core-that optimizes long-horizon skill choices and sequences under human-robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole-body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Related Papers