Apr 6, 2026arXiv:2604.04406

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

Ze-Xin Yin, Liu Liu, Xinjie Wang, Zhizhong Su

AI Summary

3D-Fixer is introduced as a novel in-place completion paradigm for compositional 3D scene generation from a single image, addressing limitations of existing feed-forward and per-instance methods. It leverages fragmented geometry from geometry estimation as a spatial anchor, enabling the generation of complete 3D assets conditioned on partially visible point clouds without explicit pose alignment. A coarse-to-fine generation scheme, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy, resolves boundary ambiguity under occlusion, achieving state-of-the-art geometric accuracy compared to baselines.

Key Contribution

Forget painstakingly aligning objects in 3D scene generation; 3D-Fixer uses fragmented geometry as a spatial anchor, boosting accuracy while keeping things efficient.

Abstract

Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

Related Papers