UFMar 2, 2026arXiv:2603.01950

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Christopher Driggers-Ellis, Christopher Driggers-Ellis, Nachiketh Tibrewal, Nachiketh Tibrewal, Rohit Bogulla, Rohit Bogulla, Harsh Khanna, Harsh Khanna, Sangpil Youm, Sangpil Youm, Christan Grant, Christan Grant, Bonnie Dorr, Bonnie J. Dorr

AI Summary

This paper benchmarks the performance of vision-language models (VLMs) on comic interpretation tasks, specifically focusing on page-level understanding relevant for accessibility by blind or visually impaired users. The authors identify and categorize hallucinations produced by VLMs during comic interpretation, creating generalized object-hallucination taxonomies. The study reveals that semantic similarity metrics are a spurious measure of true comic understanding due to the prevalence of these hallucinations.

Key Contribution

Current VLMs struggle with page-level comic interpretation, frequently hallucinating objects and demonstrating that semantic similarity metrics are a poor proxy for true comic understanding.

Abstract

A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References20

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Related Papers