Search papers, labs, and topics across Lattice.
The paper introduces a Dual Loop Data Cleaning (DLDC) method to automatically generate high-quality remote sensing image-text training data by leveraging contrastive multimodal quality evaluations. DLDC uses an external generation loop (EGL) based on a multimodal foundational model for layout description and an internal evaluation loop (IEL) based on contrastive learning metrics to assess image-text matching. Fine-tuning T2I models with the cleaned dataset results in significant improvements in image generation quality, as evidenced by substantial reductions in FID and increases in CLIP and RemoteCLIP scores, and improved downstream segmentation performance.
Forget expensive human annotation: this dual-loop method automatically cleans remote sensing image-text datasets, boosting T2I model performance by over 35%.
Text-to-image (T2I) generation, offering flexible and intuitive synthetic data for downstream geoscience applications, has garnered increasing attention in recent years. Training a good T2I model often requires high-quality, large-scale image鈥搕ext datasets. However, obtaining these datasets in remote sensing (RS) is challenging because of high annotation costs and specific domain knowledge. This study proposes a dual loop data cleaning (DLDC) method, which leverages contrastive multimodal quality evaluations to generate high-quality RS image鈥搕ext training data automatically. By constructing an external generation loop (EGL) based on a multimodal foundational model and an internal evaluation loop (IEL) based on contrastive learning metrics, DLDC can automatically generate layout description and evaluate image鈥搕ext matching degree on satellite images. The proposed approach effectively filters out noisy samples and curates a refined dataset without human intervention. Experimental results show that our dual loop evaluation can accurately determine the optimal data cleaning ratio for different scenes, improving image generation quality. Compared with the pretrained T2I models, our fine-tuned models reduce Fr茅chet Inception Distance values by over 35%, increase CLIP scores by more than 25%, and improve RemoteCLIP scores by over 10.5%. Furthermore, our DLDC method can achieve superior performance compared to other state-of-the-art RS T2I models (e.g., Crs-diff, GeoRSSD, DiffusionSAT). Our data-cleaning method can improve downstream segmentation tasks, resulting in 8.14% in mean IoU and 7.5% in mean accuracy compared to the same model trained on raw or uncleaned data. Experimental results demonstrate that our automatically generated image鈥搕ext data is of a similar quality to human manually annotated data, opening new pathways for rapid, cost-effective, and reliable RS data generation.