Mar 3, 2026arXiv:2603.03024

MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN

Lingwei Luo, Ling Luo, Qi Bai, Qianqian Bai

AI Summary

The paper introduces MA-CoNav, a multi-agent framework for Vision-Language Navigation (VLN) that uses a master-slave architecture to distribute perception, planning, execution, and memory across specialized agents. This framework incorporates a dual-stage reflection mechanism to optimize the navigation pipeline dynamically. Experiments on a real-world indoor dataset show that MA-CoNav outperforms existing VLN methods, demonstrating improved performance in long-horizon navigation tasks without scene-specific fine-tuning.

Key Contribution

Overcome cognitive overload in long-horizon VLN with a master-slave multi-agent architecture that distributes tasks and dynamically optimizes via dual-level reflection.

Abstract

Vision-Language Navigation (VLN) aims to empower robots with the ability to perform long-horizon navigation in unfamiliar environments based on complex linguistic instructions. Its success critically hinges on establishing an efficient ``language-understanding -- visual-perception -- embodied-execution''closed loop. Existing methods often suffer from perceptual distortion and decision drift in complex, long-distance tasks due to the cognitive overload of a single agent. Inspired by distributed cognition theory, this paper proposes MA-CoNav, a Multi-Agent Collaborative Navigation framework. This framework adopts a ``Master-Slave''hierarchical agent collaboration architecture, decoupling and distributing the perception, planning, execution, and memory functions required for navigation tasks to specialized agents. Specifically, the Master Agent is responsible for global orchestration, while the Subordinate Agent group collaborates through a clear division of labor: an Observation Agent generates environment descriptions, a Planning Agent performs task decomposition and dynamic verification, an Execution Agent handles simultaneous mapping and action, and a Memory Agent manages structured experiences. Furthermore, the framework introduces a ``Local-Global''dual-stage reflection mechanism to dynamically optimize the entire navigation pipeline. Empirical experiments were conducted using a real-world indoor dataset collected by a Limo Pro robot, with no scene-specific fine-tuning performed on the models throughout the process. The results demonstrate that MA-CoNav comprehensively outperforms existing mainstream VLN methods across multiple metrics.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN

Related Papers