Search papers, labs, and topics across Lattice.
The paper introduces LMNav, a landmark-centric framework for Vision-and-Language Navigation in Continuous Environments (VLN-CE) that extracts semantic landmarks from language instructions and constructs a directed graph to guide navigation. LMNav enhances visual understanding through view expansion and a semantic-depth fusion module, incorporating semantic segmentation and depth information with SE attention. Experiments on the R2R-CE benchmark demonstrate that LMNav achieves a success rate of 51.6% and an SPL of 43.0, outperforming baselines by leveraging structured landmark reasoning and cross-modal integration.
Achieve human-like navigation by representing instructions as landmark graphs and fusing semantic and depth information to improve visual understanding.
This paper presents LMNav, a landmark-centric navigation framework for Vision-And-Language Navigation in Continuous Environments (VLN-CE). Inspired by the way humans navigate using landmarks and their spatial relationships, LMNav explicitly extracts semantic landmarks and their temporal order from natural language instructions via a language model, constructing a structured directed graph. This graph guides both the agent’s reasoning and perception processes. To enhance visual understanding, LMNav employs view expansion and a semantic-depth fusion module that integrates semantic segmentation and depth cues using an SE attention mechanism. Landmark features are further modulated by a dynamic masking mechanism based on distance, and then fused with RGB embeddings via a cross-attention module. These rich multimodal representations are injected into the policy network to improve decision-making. We evaluate LMNav on the R2R-CE benchmark using standard VLN metrics. The proposed model achieves a success rate of 51.6% and an SPL of 43.0, while also reducing trajectory length compared to baselines. The results demonstrate that LMNav enables more accurate and efficient navigation by leveraging structured landmark reasoning and cross-modal integration.