Search papers, labs, and topics across Lattice.
University of Trento, B in scale. Moreover, EarthMind outperforms existing methods on multiple public EO benchmarks, showcasing its potential to handle both multi-granular and multi-sensor challenges in a unified framework. Figure 1: The proposed EarthMind supports unified multi-granular understanding for Earth Observation (EO) imagery, including image-level, region-level, and pixel-level tasks. In addition, it enables complementary multi-sensor fusion across Optical and SAR modalities. 1 Introduction Large multimodal models (LMMs), which integrate large language models (LLMs) [1, 2] with visual encoders, have shown remarkable success across a variety of vision-language tasks, including image captioning [3, 4], visual question answering [5, 6, 7], and grounding [8, 9, 10]. Among various application domains, Earth Observation (EO) [11, 12, 13] is of particular importance, as it allows monitoring the Earth and the effect of human activities on it. However, LMMs trained on general-purpose images often struggle to generalize to EO data, due to a significant domain gap. Recent work has addressed this challenge by constructing large-scale instruction tuning datasets [14, 15, 16, 17, 18] specifically tailored for EO, enabling better adaptation of LMMs to this domain. Despite recent advances, existing LMMs are limited in their understanding of EO data. Firstly, EO tasks span multiple levels of granularity, from pixel-level segmentation [19, 20], over region-level semantic understanding [21, 22], up to image-level scene classification [23, 24]. Secondly, EO data comprise multiple sensing modalities, including optical imagery (e.g., RGB and Multispectral) and Synthetic Aperture Radar (SAR) [25, 26]. These modalities are inherently complementary: optical images provide rich texture and spectral information under favorable conditions, while SAR captures structural details regardless of weather or illumination. Although various sensor types exist, effective fusion, particularly between SAR and optical data modalities, remains a key challenge for EO understanding. As summarized in Tab. 1, achieving fine-grained, multi-sensor comprehension in EO remains largely unresolved. To tackle these challenges, we introduce EarthMind, the first LMM capable of fusing multi-sensor EO inputs and performing reasoning across multiple semantic levels, as shown in Fig. 1. It achieves this by projecting heterogeneous features from different sensors and scales into a unified semantic space, thus enabling effective interpretation by LLMs. The novelty of EarthMind lies in two key design concepts that enable spatial grounding and cross-modal understanding in EO data. First, Spatial Attention Prompting (SAP) enhances pixel-level grounding by explicitly extracting and reallocating attention to regions aligned with queried objects. This overcomes limitations of prior approaches [9, 10] that combine segmentation foundation models [27, 28] with LLMs but degrade in EO settings due to vague boundaries and scale imbalances. Second, a Cross-modal Fusion mechanism, built upon token-level contrastive learning, guides the integration of complementary modalities (e.g., RGB and SAR) into a unified semantic space. Equipped with Modality Mutual Attention, EarthMind adaptively selects the most informative features from each modality, thereby facilitating robust autoregressive learning within the LLM. Additionally, we propose EarthMind-Bench, a new benchmark designed to evaluate LMMs in challenging EO scenarios. As shown in Tab. 1, EarthMind-Bench offers several unique features. Firstly, it encompasses multi-granular tasks, ranging from coarse-grained image understanding to fine-grained segmentation. Secondly, it introduces multi-sensor data, in particular paired RGB-SAR imagery, enabling evaluation of cross-modal fusion capabilities. Thirdly, it covers multi-level questions, spanning low-level perception as well as high-level reasoning. EarthMind-Bench consists of more than 2,000 multiple-choice and open-ended questions, providing a comprehensive benchmark to assess the ability of LMMs to interpret and reason over EO data. We implemented EarthMind based on Qwen-2.5-, INSAIT, Sofia University “St. Kliment Ohridski” https://github.com/shuyansy/EarthMind Abstract Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data understanding. EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space and adaptively reweighs tokens based on their information density for effective fusion. To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs, covering a wide range of perception and reasoning tasks. Extensive experiments demonstrate the effectiveness of EarthMind. It achieves state-of-the-art performance on EarthMind-Bench, surpassing GPT-4o despite being only
1
0
3
2
EarthMind demonstrates that hierarchical cross-modal attention across optical and SAR data significantly boosts MLLM performance on Earth Observation tasks, outperforming models limited to single-sensor inputs.