Search papers, labs, and topics across Lattice.
This paper benchmarks the performance of vision-language models (VLMs) on comic interpretation tasks, specifically focusing on page-level understanding relevant for accessibility by blind or visually impaired users. The authors identify and categorize hallucinations produced by VLMs during comic interpretation, creating generalized object-hallucination taxonomies. The study reveals that semantic similarity metrics are a spurious measure of true comic understanding due to the prevalence of these hallucinations.
Current VLMs struggle with page-level comic interpretation, frequently hallucinating objects and demonstrating that semantic similarity metrics are a poor proxy for true comic understanding.
A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.