Search papers, labs, and topics across Lattice.
The paper introduces EarthMind, a unified vision-language framework for Earth Observation (EO) data analysis that leverages both single- and cross-sensor inputs by using a hierarchical cross-modal attention (HCA) mechanism to fuse optical and SAR data. To facilitate cross-sensor learning, the authors curated FusionEO, a 30K-pair dataset, and EarthMind-Bench, a 2,841-pair benchmark with expert annotations. Experiments demonstrate that EarthMind achieves state-of-the-art performance on EarthMind-Bench and outperforms existing MLLMs on multiple EO benchmarks, highlighting the benefits of cross-sensor fusion for EO tasks.
EarthMind demonstrates that hierarchical cross-modal attention across optical and SAR data significantly boosts MLLM performance on Earth Observation tasks, outperforming models limited to single-sensor inputs.
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.