HuaweiMar 18, 2026arXiv:2603.17519

UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Guibiao Liao, Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang

AI Summary

The paper introduces UniSem, a unified framework for semantic-aware 3D reconstruction from sparse, unposed images using 3D Gaussian Splatting (3DGS). UniSem employs Error-aware Gaussian Dropout (EGD) to improve depth accuracy by suppressing redundant Gaussians based on rendering error. It also uses a Mix-training Curriculum (MTC) that combines 2D segmenter-lifted semantics with emergent 3D semantic priors via object-level prototype alignment, enhancing semantic coherence and completeness. Experiments on ScanNet and Replica demonstrate that UniSem achieves state-of-the-art performance in both depth prediction and open-vocabulary 3D segmentation, especially with sparse inputs.

Key Contribution

Achieve state-of-the-art semantic 3D reconstruction from sparse views by intelligently pruning redundant Gaussians and blending 2D and 3D semantic cues.

Abstract

Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Related Papers