DFKIMar 4, 2026arXiv:2603.03935

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

F. Igelbrink, Felix Igelbrink, Lennart Niecksch, Martin Atzmueller, Joachim Hertzberg

AI Summary

The paper introduces DISC, a novel approach for open-set semantic mapping that addresses the limitations of instance-centric methods by using a single-pass, distance-weighted extraction mechanism to derive CLIP embeddings directly from the vision transformer's intermediate layers. This eliminates the need for image cropping and enables pure, mask-aligned semantic representations. DISC is built on a fully GPU-accelerated architecture for on-the-fly voxel-level instance refinement, and evaluations on Replica, ScanNet, and HM3DSEM datasets demonstrate significant improvements in semantic accuracy and query retrieval compared to state-of-the-art zero-shot methods.

Key Contribution

Ditch slow, context-deprived image crops: DISC extracts high-fidelity CLIP embeddings directly from vision transformer layers for faster, more accurate open-set semantic mapping.

Abstract

Open-set semantic mapping enables language-driven robotic perception, but current instance-centric approaches are bottlenecked by context-depriving and computationally expensive crop-based feature extraction. To overcome this fundamental limitation, we introduce DISC (Dense Integrated Semantic Context), featuring a novel single-pass, distance-weighted extraction mechanism. By deriving high-fidelity CLIP embeddings directly from the vision transformer's intermediate layers, our approach eliminates the latency and domain-shift artifacts of traditional image cropping, yielding pure, mask-aligned semantic representations. To fully leverage these features in large-scale continuous mapping, DISC is built upon a fully GPU-accelerated architecture that replaces periodic offline processing with precise, on-the-fly voxel-level instance refinement. We evaluate our approach on standard benchmarks (Replica, ScanNet) and a newly generated large-scale-mapping dataset based on Habitat-Matterport 3D (HM3DSEM) to assess scalability across complex scenes in multi-story buildings. Extensive evaluations demonstrate that DISC significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment. The full source code, data generation and evaluation pipelines will be made available at https://github.com/DFKI-NI/DISC.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

Related Papers