NSFCMar 11, 2026arXiv:2603.10682

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

Guiyong Zheng, Y. Ban, Mingjie Zhang, Juepeng Zheng, Boyu Zhou

AI Summary

The paper introduces OnFly, a real-time, fully onboard framework for zero-shot aerial vision-language navigation (AVLN) that addresses instability, unreliable progress monitoring, and the safety-efficiency trade-off in existing methods. OnFly uses a dual-agent architecture for decoupled target generation and progress monitoring, a hybrid memory for trajectory context, and a semantic-geometric verifier to refine VLM-predicted targets. Experiments demonstrate a significant improvement in task success rate in simulation (26.4% to 67.8%) and validated real-time deployment feasibility in real-world flights.

Key Contribution

Achieve 2.5x higher success in UAV navigation by decoupling target generation from progress monitoring, enabling safer and more efficient zero-shot flight.

Abstract

Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment. The code will be released at https://github.com/Robotics-STAR-Lab/OnFly

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References21

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

Related Papers