ZJUMay 26, 2026arXiv:2605.26500

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

AI Summary

This paper introduces a 3D Gaussian Map representation for Vision-Language Navigation (VLN) that captures both geometric and semantic information of the environment. The map is constructed from sparse pseudo-lidar point clouds, with each Gaussian primitive enriched by an Open-Set Semantic Grouping operation to categorize them into object instances or stuff categories. A Multi-Level Action Prediction strategy leverages this map to improve navigation decision-making, demonstrating state-of-the-art results on R2R, R4R, and REVERIE benchmarks.

Key Contribution

Representing environments as semantically-grouped 3D Gaussians dramatically improves Vision-Language Navigation in unseen environments.

Abstract

Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

Related Papers