Mar 19, 2026arXiv:2603.19039

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu, B. Ren, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, B. Demir, Begüm Demir, Nicu Sebe, Paolo Rota

AI Summary

TerraScope, a new vision-language model, is introduced to tackle pixel-grounded geospatial reasoning in Earth Observation by adaptively fusing optical and SAR modalities and integrating temporal sequences for change analysis. To facilitate training and evaluation, the authors curate Terra-CoT, a large-scale dataset with 1 million samples containing pixel-level masks, and TerraScope-Bench, a benchmark with six sub-tasks that evaluates both answer accuracy and mask quality. Experiments demonstrate that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning and provides interpretable visual evidence.

Key Contribution

Pixel-perfect geospatial reasoning is now possible, thanks to a vision-language model that adaptively fuses multi-modal and multi-temporal Earth observation data.

Abstract

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Related Papers