Mar 10, 2026arXiv:2603.09625

Grounding Synthetic Data Generation With Vision and Language Models

AI Summary

The paper introduces a vision-language grounded framework for generating and evaluating synthetic data for remote sensing tasks, leveraging generative models, semantic segmentation, image captioning, and vision-language models. They create ARAS400k, a large-scale remote sensing dataset augmented with 300k synthetic images, and use it to evaluate synthetic data quality based on semantic composition, caption redundancy, and cross-modal consistency. Experiments show that models trained on augmented data (real + synthetic) outperform those trained only on real data, demonstrating the effectiveness of their approach for improving performance in semantic segmentation and image captioning.

Key Contribution

Synthetic data, when grounded in vision-language models for evaluation, demonstrably boosts performance in remote sensing tasks like segmentation and captioning, outperforming models trained solely on real-world data.

Abstract

Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Grounding Synthetic Data Generation With Vision and Language Models

Related Papers