Search papers, labs, and topics across Lattice.
This paper introduces EgoPriMo, a novel framework that learns egocentric motion priors for humanoid robots from human demonstrations, enabling the generation and forecasting of full-body motion based on egocentric observations and text prompts. By employing a Triple-stream DiT architecture, EgoPriMo effectively integrates body dynamics, visual context, and textual input, allowing for task-specific adaptations without requiring exhaustive motion specifications. Experimental results demonstrate significant improvements in motion generation capabilities compared to existing methods, showcasing the framework's potential for scalable and interactive humanoid control.
EgoPriMo enables humanoid robots to generate and forecast complex motions interactively using just egocentric observations and high-level language prompts.
Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.