Mar 10, 2026arXiv:2603.09170

ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Dong Wang, Xuelong Li

AI Summary

ZeroWBC learns humanoid visuomotor control policies directly from human egocentric videos, bypassing the need for expensive robot teleoperation data. A fine-tuned VLM predicts future human motions from text instructions and egocentric vision, which are then retargeted to robot joints. Experiments on a Unitree G1 robot show ZeroWBC achieves more natural and versatile motions compared to baselines, establishing a scalable pipeline for humanoid whole-body control.

Key Contribution

Skip the costly robot teleoperation data: ZeroWBC learns surprisingly natural humanoid control policies directly from human egocentric videos.

Abstract

Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References52

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

Related Papers