NVIDIANUDTShanghai InnovationSJTUThe Hunan Provincial Key Laboratory of ImageUCLAJun 3, 2026arXiv:2606.05160

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Tianyi Xie, Haotian Zhang, Haotian Zhang, Jinhyung Park, Zi Wang, Bowen Wen, Jiefeng Li, Xueting Li, Qingwei Ben, Haoyang Weng, Yufei Ye, David Minor, Tingwu Wang, Chenfanfu Jiang, Sanja Fidler, Sanja Fidler, Jan Kautz, Jan Kautz, Linxi Fan, Yuke Zhu, Yuke Zhu, Zhengyi Luo, Umar Iqbal, Ye Yuan

AI Summary

GRAIL is a digital generation pipeline that synthesizes humanoid loco-manipulation tasks using 3D assets and video priors, circumventing the limitations of traditional teleoperation and motion capture methods. By leveraging fully specified 3D configurations, GRAIL enhances the accuracy of 4D recovery for human-object interactions, resulting in improved object tracking and motion estimation. The system successfully generates over 20,000 diverse sequences and achieves high real-world success rates for egocentric visual policies on a humanoid robot, demonstrating its effectiveness in practical applications.

Key Contribution

GRAIL achieves an impressive 84% success rate in real-world object pick-up tasks using only synthetic data, revolutionizing humanoid robot training.

Abstract

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Related Papers