Mar 31, 2026arXiv:2603.29387

Extend3D: Town-Scale 3D Generation

AI Summary

Extend3D is introduced, a training-free pipeline for generating town-scale 3D scenes from a single image by extending the latent space of an object-centric 3D generative model. The method divides the extended latent space into overlapping patches, applies the generative model to each patch, and couples them, using a point cloud prior from monocular depth estimation for initialization and SDEdit for iterative refinement. A novel "under-noising" technique treats 3D structure incompleteness as noise during refinement, and 3D-aware optimization objectives are introduced to improve geometric structure and texture fidelity.

Key Contribution

Generating large-scale 3D scenes from a single image becomes surprisingly effective by treating 3D structure incompleteness as noise during refinement.

Abstract

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Extend3D: Town-Scale 3D Generation

Related Papers