BeihangHKUFeb 10, 2026arXiv:2602.10106

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, Li Chen

AI Summary

The paper introduces EgoHumanoid, a framework for training vision-language-action policies for humanoid loco-manipulation by leveraging abundant egocentric human demonstrations and a small amount of robot data. To address the embodiment gap between humans and robots, the authors propose a systematic alignment pipeline that includes view alignment to reduce visual domain discrepancies and action alignment to map human motions into a kinematically feasible action space. Real-world experiments demonstrate that incorporating robot-free egocentric data significantly improves performance compared to robot-only baselines, achieving a 51% improvement, especially in unseen environments.

Key Contribution

Humanoid robots can now learn complex loco-manipulation skills in diverse real-world environments by watching humans, achieving a 51% performance boost over robot-only training.

Abstract

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References85

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

Related Papers