Search papers, labs, and topics across Lattice.
This paper introduces Earth-OneVision, a 2B parameter remote sensing multimodal large language model (RS-MLLM) that integrates six sensor modalities and supports nine task categories within a unified autoregressive framework. By employing innovative mechanisms such as Full-Granularity Vision-Language Alignment, Spatial-Linguistic Isomorphic Serialization, and Progressive Cross-Modality Adaptation, the model effectively addresses key challenges in cross-modal geoscientific knowledge extraction. Earth-OneVision achieves state-of-the-art performance on multiple benchmarks, outperforming larger models while utilizing significantly fewer parameters, demonstrating its efficiency and effectiveness in remote sensing applications.
Earth-OneVision unifies six sensor modalities into a single model, achieving superior performance with only 2B parameters compared to larger counterparts.
RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.