Search papers, labs, and topics across Lattice.
AwareVLN is introduced to improve Vision-Language Navigation by incorporating a self-aware reasoning mechanism that understands the agent's state and task progress. This is achieved through a structural reasoning module for spatial and task-oriented self-awareness and an automatic data engine with progress division for training. Experiments on Habitat demonstrate that AwareVLN significantly outperforms existing VLN methods.
VLN agents can now navigate more effectively by reasoning about their own state and task progress, closing the gap between end-to-end VLMs and explicit scene mapping.
Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.