May 5, 2026arXiv:2605.03669

FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

Timon Homberger, F. Busch, Jes'us Gerardo Ortega Peimbert, Quantao Yang, Olov Andersson

AI Summary

FUS3DMaps introduces a dual-layer semantic mapping approach that combines dense and instance-level open-vocabulary layers within a shared voxel map for improved spatial grounding of unseen concepts. It fuses voxel-level semantic information across these layers, leveraging the strengths of both dense projection and instance segmentation methods. Experiments on 3D semantic segmentation benchmarks demonstrate that FUS3DMaps achieves accurate open-vocabulary semantic mapping at the scale of multi-story buildings, outperforming existing methods in both accuracy and scalability.

Key Contribution

Achieve scalable open-vocabulary semantic maps of entire buildings by fusing both dense and instance-level semantic information in a novel dual-layer voxel representation.

Abstract

Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

Related Papers