Apr 9, 2026arXiv:2604.07705

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

Xingyu Xia, Lekai Zhou, Yujie Tang, Xiaozhou Zhu, Hai Zhu, Wen Yao

AI Summary

This survey paper provides a comprehensive review of the Aerial Vision-Language Navigation (Aerial VLN) field, focusing on the integration of Large Language Models (LLMs) and Vision-Language Models (VLMs). It categorizes existing methods into five architectural categories, analyzing their design rationales, trade-offs, and performance. The paper also identifies gaps in evaluation infrastructure and outlines seven key open problems to guide future research in Aerial VLN.

Key Contribution

Aerial VLN is ripe for disruption by LLMs, but current benchmarks lack the scale, diversity, and real-world grounding needed to truly evaluate progress.

Abstract

Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References156

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

Related Papers