Lancaster UniversityApr 7, 2026arXiv:2604.05780

Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

Yu Xue, Longjun Gao, Yuanqi Su, HaoAng Lu, Xiaoning Zhang

AI Summary

VoxSAMNet tackles monocular semantic scene completion by explicitly addressing voxel sparsity and semantic imbalance. It uses a Dummy Shortcut for Feature Refinement (DSFR) to bypass empty voxels with a shared dummy node and refines occupied voxels using deformable attention. A Foreground Modulation Strategy, combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF), mitigates overfitting and enhances class-relevant features, achieving state-of-the-art mIoU scores of 18.2% and 20.2% on SemanticKITTI and SSCBench-KITTI-360, respectively.

Key Contribution

Ignoring voxel sparsity in 3D scene completion costs you 2% mIoU, which VoxSAMNet recovers with a novel dummy node shortcut and foreground modulation strategy.

Abstract

Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

Related Papers