Search papers, labs, and topics across Lattice.
ImagenWorld is introduced as a benchmark comprising 3.6K diverse condition sets across six image generation and editing tasks and six topical domains, designed to stress-test image synthesis models. The benchmark incorporates 20K human annotations and an explainable evaluation schema to identify localized object-level and segment-level errors, complementing VLM-based metrics. Evaluation of 14 models reveals that models struggle with editing tasks, symbolic domains, and text-heavy domains, while closed-source systems perform best overall, and VLM-based metrics show promise but lack fine-grained error attribution.
Image generation models ace photorealistic art but still choke on screenshots and infographics, highlighting a critical gap in real-world applicability.
Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.