Tencent AIMay 6, 2026arXiv:2605.04503

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Yuancheng Wei, Linli Yao, Jiali Chen, Yiting Lu, Duojun Huang, Zhao Zhong

AI Summary

The authors introduce DiffCap-Bench, a new benchmark for Image Difference Captioning (IDC) designed to address limitations in existing datasets regarding diversity, compositional complexity, and evaluation metrics. They propose an LLM-as-a-Judge evaluation protocol using human-validated Difference Lists to improve the assessment of models' ability to capture and describe visual changes. Experiments using DiffCap-Bench reveal performance gaps between proprietary and open-source MLLMs, emphasize the importance of reasoning, and highlight limitations in model scaling for IDC tasks.

Key Contribution

Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.

Abstract

Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models'ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Related Papers