Chang Guang Satellite TechnologyJilinMar 10, 2026arXiv:2603.09471

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, Bo Yang

AI Summary

The authors introduce OmniEarth, a new benchmark designed to evaluate vision-language models (VLMs) on a comprehensive set of 28 fine-grained geospatial tasks spanning perception, reasoning, and robustness. The benchmark includes 9,275 quality-controlled images, including proprietary satellite imagery, and 44,210 manually verified instructions, addressing a gap in systematic evaluation for remote sensing VLMs. Evaluation of existing VLMs on OmniEarth reveals significant challenges in handling geospatially complex tasks, highlighting areas for improvement in remote sensing applications.

Key Contribution

VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.

Abstract

Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

Related Papers