Search papers, labs, and topics across Lattice.
The authors introduce BigEarthNet.txt, a large-scale, multi-sensor remote sensing image-text dataset comprising co-registered Sentinel-1 and Sentinel-2 imagery paired with diverse textual annotations. The dataset includes geographically anchored captions, visual question answering pairs, and referring expression detection instructions, totaling 464,044 image patches and 9.6M text annotations. Benchmarking experiments reveal the limitations of existing VLMs on complex land-use/land-cover classification tasks in remote sensing, while fine-tuning on BigEarthNet.txt yields consistent performance improvements.
VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.
Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.