Apr 21, 2026arXiv:2604.19105

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Ruibing Hou, Mingyue Zhou, Yuwei Gui, Yuwei Gui, Mingshuang Luo, Bingpeng Ma, Shiguang Shan, Xilin Chen

AI Summary

The paper tackles the problem of generating 3D human motion from egocentric vision and language inputs, identifying a "reasoning-generation entanglement" that hinders performance. To address this, they propose EgoMotion, a hierarchical framework that decouples cognitive reasoning (using a VLM to map inputs to discrete motion primitives) and motion generation (using a diffusion model for trajectory synthesis). Experiments show EgoMotion achieves state-of-the-art performance in generating semantically grounded and kinematically plausible egocentric motion.

Key Contribution

Decoupling high-level reasoning from low-level motor control in egocentric motion generation yields surprisingly realistic and controllable human movements.

Abstract

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Related Papers