TU DarmstadtMar 3, 2026arXiv:2603.02681

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Song Guo, Qinglin Lu

AI Summary

The paper introduces VisionCreator, a visual-generation agentic model designed to unify Understanding, Thinking, Planning, and Creation (UTPC) within an end-to-end learnable framework. To train this model, the authors created VisGenData-4k, a dataset of high-quality creation trajectories generated by a metacognition-based VisionAgent. VisionCreator is optimized using Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) and demonstrates superior performance compared to larger closed-source models on the newly introduced VisGenBench benchmark.

Key Contribution

VisionCreator, an 8B/32B agent, beats larger closed-source models at visual content creation by unifying understanding, thinking, planning, and creation within a single end-to-end framework.

Abstract

Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Related Papers