KRAFTONMar 9, 2026arXiv:2603.07888

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Minkyu Kim, Minkyu Kim, Sangheon Lee, Sangheon Lee, Dongmin Park, Dongmin Park

AI Summary

VLM-SubtleBench, a new benchmark, is introduced to evaluate VLMs on their ability to perform subtle comparative reasoning across ten difference types and diverse domains like industrial, aerial, and medical imaging. The benchmark reveals systematic performance gaps between VLMs and humans, especially in nuanced reasoning tasks. Controlled analyses pinpoint specific areas where VLMs' reasoning capabilities significantly degrade.

Key Contribution

VLMs still struggle with subtle visual differences, exhibiting a significant gap compared to human-level comparative reasoning across diverse domains like medical and industrial imaging.

Abstract

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs'reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Related Papers