Central South UniversityInstitute of Artificial IntelligenceJSUSJTUJun 8, 2026arXiv:2606.08992

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

AI Summary

This paper introduces SpaceVLN, a zero-shot navigation agent that leverages Spatial Cognitive Memory and Task-Guided Spatial Reasoning to enhance Vision-and-Language Navigation in continuous environments. By organizing planning and execution around verifiable spatial stages, SpaceVLN effectively abstracts explored regions into Spatial Waypoints and maintains dynamic evidence of landmarks, allowing for improved spatial-relation understanding and task progress localization. The agent outperforms existing methods across multiple benchmarks and demonstrates real-robot applicability, underscoring the importance of spatial reasoning in embodied navigation tasks.

Key Contribution

SpaceVLN achieves state-of-the-art zero-shot navigation performance by integrating spatial cognitive memory with task-guided reasoning, redefining how agents understand and navigate complex environments.

Abstract

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Related Papers