Search papers, labs, and topics across Lattice.
The paper introduces a dual-teacher contrastive distillation framework to improve multispectral Earth Observation (EO) foundation models by transferring knowledge from both multispectral and optical vision foundation models (VFMs). This method aligns the student model's pretraining objective with the contrastive self-distillation paradigm used in optical VFMs, enabling effective cross-modal representation learning. Experiments demonstrate state-of-the-art performance on diverse optical and multispectral benchmarks, with significant improvements in semantic segmentation, change detection, and classification tasks, showing the efficacy of contrastive distillation for scalable representation learning in EO.
Multispectral Earth observation models can now learn more effectively from optical vision foundation models using a dual-teacher distillation approach, boosting performance by up to 3.64% on key tasks.
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Code: Coming soon.