Search papers, labs, and topics across Lattice.
Extend3D is introduced, a training-free pipeline for generating town-scale 3D scenes from a single image by extending the latent space of an object-centric 3D generative model. The method divides the extended latent space into overlapping patches, applies the 3D generative model to each patch, and couples them iteratively while using a point cloud prior from monocular depth estimation for initialization and SDEdit for refinement. Key to the approach is "under-noising," treating 3D structure incompleteness as noise during refinement, and optimizing the extended latent space with 3D-aware objectives to improve geometric structure and texture fidelity.
Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.
In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.