D Consist.D image levelSydneyMar 17, 2026arXiv:2603.16099

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Mingming Gong, Jia-Wang Bian

AI Summary

OneWorld is introduced, a 3D scene generation framework that performs diffusion directly in a unified 3D representation space to improve cross-view consistency. The core is a 3D Unified Representation Autoencoder (3D-URAE) that leverages pretrained 3D foundation models, injecting appearance and distilling semantics into a unified 3D latent space. Token-level Cross-View-Correspondence (CVC) consistency loss and Manifold-Drift Forcing (MDF) are used to enforce structural alignment and mitigate train-inference exposure bias, respectively, leading to improved 3D scene generation.

Key Contribution

Generating 3D scenes with diffusion models just got a whole lot more consistent across views, thanks to a new 3D-native approach that skips the 2D latent space bottleneck.

Abstract

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References76

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Related Papers