Search papers, labs, and topics across Lattice.
The paper introduces a human-centric video world model for extended reality (XR) that generates egocentric virtual environments conditioned on tracked head pose and joint-level hand poses, enabling more natural embodied interaction. They propose and evaluate diffusion transformer conditioning strategies for 3D head and hand control, demonstrating dexterous hand-object interactions. Human subject evaluations show improved task performance and a higher perceived sense of control compared to baselines, validating the effectiveness of the generated reality system.
XR gets real: control virtual worlds with your head and hands, not just text prompts.
Extended reality (XR) demands generative models that respond to users'tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.