TU MunichMar 18, 2026arXiv:2603.17459

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Tianfu Li, Tian Li, Wenbo Chen, Haoxuan Xu, Xinhu Zheng, Haoang Li, Haoang Li

AI Summary

This paper introduces P$^{3}$Nav, an end-to-end framework for Vision-and-Language Navigation (VLN) that integrates perception, prediction, and planning to improve scene understanding. P$^{3}$Nav enhances perception by extracting object-level and map-level cues, predicts future waypoints to model potential agent states, and forecasts semantic map cues for proactive planning. Experiments on REVERIE, R2R-CE, and RxR-CE benchmarks demonstrate state-of-the-art performance, highlighting the benefits of integrating perception and prediction into VLN.

Key Contribution

VLN agents can navigate more effectively by predicting their future states and proactively planning based on forecasted semantic map cues, rather than relying solely on historical context.

Abstract

In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References82

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Related Papers