HITHuaweiMar 27, 2026arXiv:2603.26174

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Chong Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fanshi Li, Wangmeng Zuo, Hong Yao

AI Summary

This paper introduces CREval, a QA-based evaluation pipeline and benchmark (CREval-Bench) for assessing instruction-based image manipulation, addressing the limitations of existing MLLM-based scoring methods. CREval-Bench comprises over 800 editing samples across nine creative dimensions, enabling systematic evaluation via 13K QA pairs. Experiments using CREval reveal that while closed-source models outperform open-source models, both struggle with complex and creative edits, and the automated metrics align well with human judgment.

Key Contribution

Current image editing models, even closed-source ones, still fall short on complex and creative instruction-based tasks, as revealed by a new interpretable QA-based evaluation framework.

Abstract

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Related Papers