Search papers, labs, and topics across Lattice.
This paper addresses the problem of 3D inconsistency in video diffusion models, which hinders the reconstruction of 3D worlds. They introduce a method that non-rigidly aligns video frames into a globally consistent coordinate frame using an iterative frame-to-model ICP and global optimization. The aligned pointcloud is then used to initialize 3D reconstruction with a novel inverse deformation rendering loss, resulting in higher quality and explorable 3D environments.
Turn inconsistent video diffusion models into surprisingly coherent 3D world generators with a novel alignment and rendering approach.
Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.