Search papers, labs, and topics across Lattice.
The paper introduces VisionCreator, a visual-generation agentic model designed to unify Understanding, Thinking, Planning, and Creation (UTPC) within an end-to-end learnable framework. To train this model, the authors created VisGenData-4k, a dataset of high-quality creation trajectories generated by a metacognition-based VisionAgent. VisionCreator is optimized using Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) and demonstrates superior performance compared to larger closed-source models on the newly introduced VisGenBench benchmark.
VisionCreator, an 8B/32B agent, beats larger closed-source models at visual content creation by unifying understanding, thinking, planning, and creation within a single end-to-end framework.
Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.