Portsmouth AI and Data Science Centre (PAIDS)UW-MadisonMar 2, 2026arXiv:2603.01576

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Saurabh Kaushik, Saurabh Kaushik, Lalit Maurya, Lalit Maurya, Beth Tellman, Beth Tellman

AI Summary

The paper introduces Cryo-Bench, a new benchmark dataset for evaluating Geo-Foundation Models (GFMs) on cryosphere-related tasks including debris-covered glaciers, glacial lakes, sea ice, and calving fronts. The authors evaluated 14 GFMs alongside UNet and ViT baselines, finding that UNet with a frozen encoder achieves the highest average mIoU overall, while GFMs like DOFA and TerraMind excel in few-shot settings. Fine-tuning GFMs with optimized learning rates substantially improves their performance, demonstrating their domain adaptation capabilities despite limited cryosphere data in pre-training.

Key Contribution

Despite limited cryosphere data in their pretraining, Geo-Foundation Models show surprisingly strong domain adaptation for tasks like glacial lake and sea ice segmentation.

Abstract

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).

Computer Vision Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Related Papers