PolyUMay 21, 2026arXiv:2605.22645

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

Hanjun Luo, Zhimu Huang, Sylvia Chung, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam

AI Summary

AtelierEval is introduced as the first benchmark to evaluate the prompting proficiency of both humans and MLLMs in text-to-image generation, using 360 expert-crafted tasks spanning cognitive task categories. To enable scalable evaluation, the authors propose AtelierJudge, a skill-based agentic evaluator that achieves high correlation with human expert judgment. Experiments reveal insights into prompting strategies, such as the superiority of mimicry over planning, and advocate for image-augmented prompters.

Key Contribution

Turns out, the best way to get an LLM to generate good text-to-image prompts is to have it mimic existing images, not plan from scratch.

Abstract

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AtelierEval: Agentic Evaluation of Humans &amp; LLMs as Text-to-Image Prompters

Related Papers

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters