Search papers, labs, and topics across Lattice.
This paper proposes a five-level taxonomy for visual generation models, ranging from simple "Atomic Generation" to sophisticated "World-Modeling Generation," to highlight the field's evolution beyond appearance synthesis. It identifies key technical drivers like flow matching and unified understanding-and-generation models. The authors argue that current evaluations are insufficient and propose a capability-centered approach using benchmark reviews, stress tests, and expert-constrained case studies to better assess structural, temporal, and causal reasoning in generated visuals.
Today's visual generation models excel at photorealism but still fail at the kind of spatial reasoning, long-term consistency, and causal understanding that truly intelligent visual generation demands.
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.