Feb 5, 2026arXiv:2602.05827

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li

AI Summary

This paper introduces SparseVideoNav, a novel approach to Beyond-the-View Navigation (BVN) that leverages video generation models to enable agents to navigate towards distant, unseen targets using only high-level language intents. The key insight is that video generation models inherently benefit from long-horizon supervision, making them suitable for BVN tasks where LLMs struggle due to short-sighted behaviors. To address the latency issues of video generation, SparseVideoNav generates sparse future trajectories, achieving a 27x speed-up and demonstrating a 2.5x improvement in success rate compared to LLM baselines in real-world zero-shot experiments, including challenging night scenes.

Key Contribution

Forget LLMs' short-sightedness: video generation offers a surprisingly effective path to real-world navigation from high-level goals, achieving 2.5x better success.

Abstract

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations1

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Related Papers