Mar 3, 2026arXiv:2603.02972

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin

AI Summary

TagaVLM addresses the architectural mismatch between VLMs and the dynamic, spatially-structured nature of Vision-Language Navigation (VLN) by explicitly injecting topological structures into the VLM backbone. It introduces Spatial Topology Aware Residual Attention (STAR-Att) to integrate topological edge information into the self-attention mechanism and uses an Interleaved Navigation Prompt to enhance node-level visual-text alignment. Experiments on the R2R benchmark demonstrate state-of-the-art performance among large-model-based methods, achieving a Success Rate of 51.09% and SPL of 47.18 in unseen environments.

Key Contribution

Forget brute-force scaling: injecting topological awareness into smaller VLMs drastically improves performance on Vision-Language Navigation, surpassing larger models.

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Related Papers