Search papers, labs, and topics across Lattice.
AtelierEval is introduced as the first benchmark to evaluate the prompting proficiency of both humans and MLLMs in text-to-image generation, using 360 expert-crafted tasks spanning cognitive task categories. To enable scalable evaluation, the authors propose AtelierJudge, a skill-based agentic evaluator that achieves high correlation with human expert judgment. Experiments reveal insights into prompting strategies, such as the superiority of mimicry over planning, and advocate for image-augmented prompters.
Turns out, the best way to get an LLM to generate good text-to-image prompts is to have it mimic existing images, not plan from scratch.
Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.