Mar 15, 2026arXiv:2603.14363

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan

AI Summary

AerialVLA, a novel end-to-end vision-language-action framework, is introduced for UAV navigation, directly mapping visual inputs and linguistic instructions to continuous control signals. It uses a streamlined dual-view perception strategy and a fuzzy directional prompting mechanism derived from onboard sensors to eliminate reliance on dense oracle guidance. Experiments on the TravelUAV benchmark demonstrate state-of-the-art performance in seen environments and superior generalization in unseen environments, achieving nearly three times the success rate of leading baselines.

Key Contribution

Ditch the complex modules: a minimalist, end-to-end vision-language-action model for UAV navigation achieves 3x better generalization than leading baselines by directly mapping visual inputs and language to continuous control.

Abstract

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Related Papers