Search papers, labs, and topics across Lattice.
This paper introduces PaintBench, a scalable benchmark designed to evaluate 20 precise visual editing operations across categories such as geometric transformation and color change. The evaluation reveals that existing image editing models perform poorly, with the best achieving only 17.1% mean Intersection over Union (mIoU), highlighting significant challenges in specific operations like geometric transformations and structural manipulations. Additionally, the correlation between PaintBench scores and performance on a related task, TinyGrafixBench, suggests that this benchmark can effectively measure and guide advancements in multimodal visual editing capabilities.
Existing image editing models struggle with precision, achieving only 17.1% accuracy on a new benchmark designed to evaluate fundamental visual editing tasks.
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p<0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.