Sofia University "St. Kliment Ohridski"Mar 12, 2026arXiv:2603.11804

OSM-based Domain Adaptation for Remote Sensing VLMs

Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, L. V. Gool, D. Paudel

AI Summary

This paper introduces OSMDA, a self-contained domain adaptation framework for remote sensing vision-language models (VLMs) that eliminates the need for large teacher models or manual annotations. OSMDA leverages the base VLM's OCR and chart comprehension abilities to generate captions for aerial images by pairing them with rendered OpenStreetMap (OSM) tiles. Fine-tuning the VLM on this automatically generated corpus yields state-of-the-art results on 10 remote sensing benchmarks when mixed with real data, while being more cost-effective than teacher-dependent methods.

Key Contribution

Forget expensive teacher models and manual labeling: a base VLM paired with OpenStreetMap data can annotate itself for remote sensing tasks, achieving state-of-the-art performance at a fraction of the cost.

Abstract

Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OSM-based Domain Adaptation for Remote Sensing VLMs

Related Papers