Search papers, labs, and topics across Lattice.
This paper introduces a scalable framework to evaluate the realism of synthetic image augmentations, specifically environmental conditions like fog, rain, snow, and nighttime, added to car-mounted camera images. They compare rule-based augmentation libraries against generative AI image-editing models using a vision-language model (VLM) jury for perceptual realism and embedding-based distributional analysis for similarity to real adverse-condition imagery. Results show generative AI significantly outperforms rule-based methods, even matching or exceeding the realism of real adverse-condition images for most conditions, as judged by the VLM jury.
Generative AI can now create synthetic adverse weather conditions in images so realistically that it fools a vision-language model jury, rivaling even real-world examples.
Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation. We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery. Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions. These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.