Tsinghua AIAI LaboratoryCUHKFaculty of ComputingZJUMay 21, 2026arXiv:2605.22272

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

AI Summary

Imagine2Real addresses the challenge of limited 3D data for Humanoid-Object Interaction (HOI) by leveraging video generative priors without relying on explicit geometric models. The method represents robot and object motions as unified 4D point trajectories tracked using sparse critical points, avoiding complex retargeting. By using the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain and a progressive training strategy, Imagine2Real achieves robust zero-shot physical deployment.

Key Contribution

Robots can now learn flexible, geometry-free interactions with objects directly from video, sidestepping the need for laborious 3D modeling or complex retargeting.

Abstract

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the tracker's search domain. Using a progressive training strategy, Imagine2Real learns robust behaviors with simple tracking rewards, enabling zero-shot physical deployment within a motion capture(mocap) system.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Related Papers