TrentoJun 2, 2025arXiv:2506.01667

EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

Yan Shu, Bin Ren, Zhitong Xiong, D. Paudel, L. V. Gool, B. Demir, N. Sebe, Paolo Rota

AI Summary

The paper introduces EarthMind, a unified vision-language framework for Earth Observation (EO) data analysis that leverages both single- and cross-sensor inputs by using a hierarchical cross-modal attention (HCA) mechanism to fuse optical and SAR data. To facilitate cross-sensor learning, the authors curated FusionEO, a 30K-pair dataset, and EarthMind-Bench, a 2,841-pair benchmark with expert annotations. Experiments demonstrate that EarthMind achieves state-of-the-art performance on EarthMind-Bench and outperforms existing MLLMs on multiple EO benchmarks, highlighting the benefits of cross-sensor fusion for EO tasks.

Key Contribution

EarthMind demonstrates that hierarchical cross-modal attention across optical and SAR data significantly boosts MLLM performance on Earth Observation tasks, outperforming models limited to single-sensor inputs.

Abstract

Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References60

Year2025

VenueN/A

Related Papers

Finding related papers...

Search

EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

Related Papers