Department of Computer EngineeringDepartment of Mathematical SciencesSharif University of TechnologyApr 2, 2026arXiv:2604.01764

Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Seyed Amir Kasaei, Seyed Amir Kasaei, Arash Marioriyad, Arash Marioriyad, Mahbod Khaleti, Mahbod Khaleti, Mohammadamin Fazli, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, M. Baghshah, Mohammad Hossein Rohban, MohammadHossein Rohban

AI Summary

The paper introduces RebusBench, a new benchmark designed to evaluate the cognitive visual reasoning abilities of Large Vision-Language Models (LVLMs) when solving rebus puzzles. Rebus puzzles require LVLMs to extract visual and textual attributes, retrieve linguistic prior knowledge, and perform abstract mapping to synthesize a meaning beyond the pixel space. Experiments on state-of-the-art LVLMs reveal a significant deficiency, with performance saturating below 10% Exact Match and 20% semantic accuracy, indicating a lack of cognitive reasoning despite possessing the necessary visual and linguistic components.

Key Contribution

Despite advances in vision-language models, they still fail at rebus puzzles, highlighting a critical gap in cognitive visual reasoning that neither scaling nor in-context learning can fix.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Related Papers