BITNWUApr 15, 2026arXiv:2604.13654

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen, Jie Zheng, Jie Zheng, Siqi Yang, Tianle Zeng, Tianle Zeng, Siwei Feng, Songsheng Cheng, Songsheng Cheng, Ruilong Ren, Ruilong Ren, Shuai Yuan, Xiangyue Wang, Kangli Wang, Kangli Wang, Ji Pei, Jianfeng Pei

AI Summary

This survey paper provides a structured overview of Vision-and-Language Navigation for UAVs (UAV-VLN), tracing its evolution from modular and deep learning approaches to agentic systems powered by VLMs and generative world models. It reviews essential resources like simulators, datasets, and metrics, while also critically analyzing challenges such as the sim-to-real gap and linguistic ambiguity. The paper concludes by proposing a research roadmap focused on multi-agent swarm coordination and air-ground collaborative robotics.

Key Contribution

UAV-VLN is still far from real-world deployment due to challenges like the sim-to-real gap and linguistic ambiguity, highlighting opportunities for research in areas like multi-agent coordination.

Abstract

Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Related Papers