Mar 30, 2026arXiv:2603.27982

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo

AI Summary

The paper introduces CDH-Bench, a new benchmark to evaluate commonsense-driven hallucination in VLMs, where models override visual evidence with commonsense priors. The benchmark covers counting, relational, and attribute anomalies, testing models' ability to follow visual evidence against commonsense. Experiments on frontier VLMs reveal vulnerabilities to prior-driven normalization, quantified by metrics like Counterfactual Accuracy Drop and Commonsense Collapse Rate.

Key Contribution

Even state-of-the-art vision-language models still struggle to reconcile visual evidence with commonsense, often hallucinating based on prior knowledge instead of what they actually see.

Abstract

Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence--commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence--commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence--commonsense conflict.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Related Papers