NYUPrincetonApr 10, 2026arXiv:2604.09531

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu

AI Summary

VisionFoundry is introduced, a pipeline that leverages LLMs to generate synthetic VQA datasets tailored to specific visual perception tasks, using only the task name as input. This pipeline generates questions, answers, and text-to-image prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM. Training VLMs on the resulting VisionFoundry-10K dataset yields significant performance gains on visual perception benchmarks like MMVP and CV-Bench-3D, demonstrating the effectiveness of targeted synthetic supervision.

Key Contribution

VLMs can get a 10% boost in spatial reasoning and 3D understanding by training on just 10,000 synthetic images generated automatically from task keywords.

Abstract

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Related Papers