May 19, 2025arXiv:2505.12835

FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models

Hengxing Cai, Jinhan Dong, Jingjun Tan, Jingcheng Deng, Sihang Li, Zhifeng Gao, Haidong Wang, Zicheng Su, A. Sumalee, Renxin Zhong

AI Summary

The paper introduces FlightGPT, a UAV Vision-and-Language Navigation (VLN) framework leveraging Vision-Language Models (VLMs) to improve multimodal fusion, generalization, and interpretability. FlightGPT employs a two-stage training pipeline involving Supervised Fine-Tuning (SFT) with high-quality demonstrations followed by Group Relative Policy Optimization (GRPO) guided by a composite reward. Experiments on the CityNav dataset demonstrate that FlightGPT achieves state-of-the-art performance, surpassing existing methods by 9.22% in success rate in unseen environments.

Key Contribution

FlightGPT surfs past prior UAV navigation limits, achieving a 9.22% success rate boost in unseen environments by fusing vision-language models with chain-of-thought reasoning.

Abstract

Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22\% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations8

Influential citations1

References24

Year2025

VenueConference on Empirical Methods in Natural Language Processing

Related Papers

Finding related papers...

Search

FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models

Related Papers