HKUMax PlanckTampereZJUFeb 26, 2026arXiv:2602.23205

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang, Wenjia Wang, Liang Pan, Liang Pan, Huaijin Pi, Huaijin Pi, Yuke Lou, Yuke Lou, Xuqian Ren, Xuqian Ren, Yifan Wu, Yifan Wu, Zhouyingcheng Liao, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, C. Theobalt, Taku Komura, Taku Komura

AI Summary

The paper introduces EmbodMocap, a portable pipeline using dual RGB-D sequences from two iPhones to jointly reconstruct humans and scenes in a unified metric world coordinate frame, enabling large-scale capture of scene-conditioned human motion data in the wild. This approach mitigates depth ambiguity and achieves superior alignment and reconstruction compared to single-view methods. The collected data is then used to fine-tune feedforward models for monocular human-scene reconstruction, improve physics-based character animation, and train a humanoid robot for motion imitation via sim-to-real reinforcement learning.

Key Contribution

Unlock real-world embodied AI: EmbodMocap's affordable dual-iPhone setup captures scene-aware human motion data, enabling robots to learn from human actions in everyday environments.

Abstract

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References72

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Related Papers