Search papers, labs, and topics across Lattice.
This paper introduces FR3D, a novel world model that predicts a consistent 3D latent representation for dynamic environments by disentangling ego-motion from environmental dynamics. By treating ego-motion as a latent proxy for action, FR3D addresses the physical inconsistencies found in prior generative models, such as morphing or vanishing objects over time. Extensive experiments show that FR3D achieves robust zero-shot generalization for future dynamic 3D reconstruction from monocular observations, maintaining geometric consistency even 2 seconds into the future.
Disentangling ego-motion from environmental dynamics allows FR3D to achieve unprecedented geometric consistency in future 3D reconstructions.
Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.