Mar 11, 2026arXiv:2603.10675

Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation

Penghua Ren, Haoyang Ge, Chuan Qi, Cong Huang, Hong Li, Jiang-Lin Zhao, Pei-I. Chi, Kaiyang Chen

AI Summary

This paper introduces "Cybo-Waiter," a humanoid agent framework designed to execute complex, natural language-based tasks in human environments by tightly integrating locomotion and manipulation. The framework uses a VLM to translate instructions into verifiable task programs with explicit preconditions and success conditions, grounded in 3D using SAM3 and RGB-D data. A supervisor verifies subtask completion and provides feedback for replanning, while motion primitives are selected under reachability and balance constraints to execute each subtask.

Key Contribution

Achieve robust humanoid task execution in complex environments by turning high-level language instructions into verifiable, geometrically-grounded task programs that can recover from failures.

Abstract

Robots are increasingly expected to execute open ended natural language requests in human environments, which demands reliable long horizon execution under partial observability. This is especially challenging for humanoids because locomotion and manipulation are tightly coupled through stance, reachability, and balance. We present a humanoid agent framework that turns VLM plans into verifiable task programs and closes the loop with multi object 3D geometric supervision. A VLM planner compiles each instruction into a typed JSON sequence of subtasks with explicit predicate based preconditions and success conditions. Using SAM3 and RGB-D, we ground all task relevant entities in 3D, estimate object centroids and extents, and evaluate predicates over stable frames to obtain condition level diagnostics. The supervisor uses these diagnostics to verify subtask completion and to provide condition-level feedback for progression and replanning. We execute each subtask by coordinating humanoid locomotion and whole-body manipulation, selecting feasible motion primitives under reachability and balance constraints. Experiments on tabletop manipulation and long horizon humanoid loco manipulation tasks show improved robustness from multi object grounding, temporal stability, and recovery driven replanning.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation

Related Papers