Search papers, labs, and topics across Lattice.
CMMR-VLN is introduced to address the limitations of LLM-based Vision-and-Language Navigation (VLN) agents in long-horizon and unfamiliar scenarios by providing them with structured memory and reflection capabilities. The framework constructs a multimodal experience memory indexed by panoramic visual images and salient landmarks, enabling the retrieval of relevant experiences during navigation. Experiments demonstrate significant improvements in success rate compared to NavGPT, MapGPT, and DiscussNav in both simulated and real-world tests, highlighting CMMR-VLN's potential as a backbone VLN framework.
LLM-based navigation agents can achieve up to 2x better success rates in real-world tests by selectively recalling and using relevant past experiences.
Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.