Guangdong University of TechnologySep 12, 2025

LMNav: Landmark-Centric Multimodal Reasoning for Vision-And-Language Navigation in Continuous Environments

AI Summary

The paper introduces LMNav, a landmark-centric framework for Vision-and-Language Navigation in Continuous Environments (VLN-CE) that extracts semantic landmarks from language instructions and constructs a directed graph to guide navigation. LMNav enhances visual understanding through view expansion and a semantic-depth fusion module, incorporating semantic segmentation and depth information with SE attention. Experiments on the R2R-CE benchmark demonstrate that LMNav achieves a success rate of 51.6% and an SPL of 43.0, outperforming baselines by leveraging structured landmark reasoning and cross-modal integration.

Key Contribution

Achieve human-like navigation by representing instructions as landmark graphs and fusing semantic and depth information to improve visual understanding.

Abstract

This paper presents LMNav, a landmark-centric navigation framework for Vision-And-Language Navigation in Continuous Environments (VLN-CE). Inspired by the way humans navigate using landmarks and their spatial relationships, LMNav explicitly extracts semantic landmarks and their temporal order from natural language instructions via a language model, constructing a structured directed graph. This graph guides both the agent’s reasoning and perception processes. To enhance visual understanding, LMNav employs view expansion and a semantic-depth fusion module that integrates semantic segmentation and depth cues using an SE attention mechanism. Landmark features are further modulated by a dynamic masking mechanism based on distance, and then fused with RGB embeddings via a cross-attention module. These rich multimodal representations are injected into the policy network to improve decision-making. We evaluate LMNav on the R2R-CE benchmark using standard VLN metrics. The proposed model achieves a success rate of 51.6% and an SPL of 43.0, while also reducing trajectory length compared to baselines. The results demonstrate that LMNav enables more accurate and efficient navigation by leveraging structured landmark reasoning and cross-modal integration.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References9

Year2025

Venue2025 7th International Conference on Internet of Things, Automation and Artificial Intelligence (IoTAAI)

Related Papers

Finding related papers...

Search

LMNav: Landmark-Centric Multimodal Reasoning for Vision-And-Language Navigation in Continuous Environments

Related Papers