Mar 2, 2026arXiv:2603.01477

SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment

Chaoran Xiong, Chaoran Xiong, Litao Wei, Litao Wei, Xinhao Hu, Xinhao Hu, Kehui Ma, Ziyi Xia, Ziyi Xia, Zixin Jiang, Zheng Sun, Zhen Sun, Ling Pei, Ling Pei

AI Summary

The paper introduces SFCo-Nav, a zero-shot Visual Language Navigation (VLN) framework that leverages a slow LLM for high-level planning and a fast reactive navigator for real-time execution. SFCo-Nav reduces computational cost and latency by constructing attributed graphs of imagined and perceived environments and using an asynchronous bridge to trigger the LLM planner only when navigation confidence is low. Experiments on R2R and REVERIE benchmarks show that SFCo-Nav achieves comparable or better success rates than existing zero-shot VLN methods while significantly reducing token consumption and improving speed, further demonstrating its practicality on a legged robot.

Key Contribution

Navigate faster and cheaper: SFCo-Nav cuts VLN token consumption by 50% and speeds up execution by 3.5x by intelligently triggering a slow LLM only when a fast reactive navigator loses confidence.

Abstract

Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment

Related Papers