Mar 9, 2026arXiv:2603.08063

Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Fangyu Hong, Xiangyu Zhao

AI Summary

This paper introduces a novel ranking architecture for cross-view UAV geolocalization that leverages a Large Vision-Language Model (LVLM) to explicitly model the relational dependencies between UAV and satellite imagery. The framework uses the LVLM to learn deep visual-semantic correlations and incorporates a relational-aware loss function with soft labels to improve training stability and discriminative power. Experiments on standard benchmarks demonstrate that the proposed method significantly enhances retrieval accuracy compared to existing approaches.

Key Contribution

Forget independent feature extraction: a new architecture uses LVLMs to explicitly model the relationships between drone and satellite imagery, substantially boosting geolocalization accuracy.

Abstract

The primary objective of cross-view UAV geolocalization is to identify the exact spatial coordinates of drone-captured imagery by aligning it with extensive, geo-referenced satellite databases. Current approaches typically extract features independently from each perspective and rely on basic heuristics to compute similarity, thereby failing to explicitly capture the essential interactions between different views. To address this limitation, we introduce a novel, plug-and-play ranking architecture designed to explicitly perform joint relational modeling for improved UAV-to-satellite image matching. By harnessing the capabilities of a Large Vision-Language Model (LVLM), our framework effectively learns the deep visual-semantic correlations linking UAV and satellite imagery. Furthermore, we present a novel relational-aware loss function to optimize the training phase. By employing soft labels, this loss provides fine-grained supervision that avoids overly penalizing near-positive matches, ultimately boosting both the model's discriminative power and training stability. Comprehensive evaluations across various baseline architectures and standard benchmarks reveal that the proposed method substantially boosts the retrieval accuracy of existing models, yielding superior performance even under highly demanding conditions.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

Related Papers