SDUAug 13, 2025

DebateNav: Structured Multi-VLM Expert Debate for Robust Zero-Shot Object Navigation

Henghui Sun, Weixing Tan, Lei Liu, Zhongmin Yan, Xudong Lu, Hongjun Dai

AI Summary

The paper introduces DebateNav, a multi-agent decision framework for zero-shot object navigation that leverages multiple vision-language model (VLM) experts, each with a unique role, under the supervision of an LLM controller to engage in structured debates when perception conflicts arise. DebateNav incorporates a multimodal image fusion module, map memory, and trajectory tracking to enhance perception and decision-making. Experiments on the HM3D dataset demonstrate a 52.3% success rate and 15.5 SPL, outperforming baselines and highlighting the effectiveness of the expert debate mechanism and other components.

Key Contribution

Forget one-shot planning: DebateNav's multi-VLM expert debate framework achieves a 52.3% success rate in zero-shot object navigation by having specialized VLMs argue their perspectives, arbitrated by an LLM.

Abstract

Zero-shot object navigation presents a highly challenging task in embodied AI, requiring an agent to interpret natural language instructions, perceive complex visual environments, and plan actions without any task-specific training. While recent approaches have introduced large language models (LLMs) as high-level planners, they often rely on static, one-shot inference and struggle with ambiguous or partially observable scenes. This paper proposes DebateNav, a novel multi-agent decision framework that integrates multiple vision-language model (VLM) experts under the supervision of a central LLM controller. Each VLM is assigned a unique expert role (e.g., object detection, risk assessment, spatial reasoning), and together they engage in structured multi-round debates when perception conflicts arise. The LLM controller performs task decomposition, memory-guided exploration, and final arbitration based on expert arguments. To enhance the perception and decision process, DebateNav incorporates a multimodal image fusion module combining RGB, depth, and segmentation inputs, as well as a map memory and trajectory tracking system that helps avoid redundant exploration and supports long-horizon planning. The system is evaluated on a subset of the HM3D dataset with approximately 5,000 tasks, achieving a $\mathbf{5 2. 3 \%}$ success rate and $\mathbf{1 5. 5}$ SPL under strict zero-shot conditions. Extensive ablation studies confirm the effectiveness of the expert debate mechanism, multi-modal fusion, memory system, and LLM-based arbitration. The results demonstrate that DebateNav outperforms several recent baselines and establishes a new perspective on collaborative, interpretable planning for zero-shot embodied navigation.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References28

Year2025

VenueIEEE International Conference on High Performance Computing and Communications

Related Papers

Finding related papers...

Search

DebateNav: Structured Multi-VLM Expert Debate for Robust Zero-Shot Object Navigation

Related Papers