Bairong Inc.Jul 21, 2025arXiv:2507.15597

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu

AI Summary

The paper introduces Being-H0, a Vision-Language-Action (VLA) model pretrained on a large-scale dataset of human videos to improve dexterity and generalization in manipulation tasks. They address the limitations of existing VLAs by using human hand demonstrations from web data and propose a physical instruction tuning paradigm that combines VLA pretraining, physical space alignment, and post-training adaptation for robotic tasks. The model incorporates a part-level motion tokenization method for precise hand trajectory modeling and demonstrates strong performance in hand motion generation, instruction following, and real-world robotic manipulation.

Key Contribution

Forget synthetic data and limited teleoperation: Being-H0 leverages the dexterity and scalability of human hand videos for VLA pretraining, unlocking superior performance in complex manipulation tasks.

Abstract

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations36

Influential citations4

References161

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Related Papers