Apr 17, 2026arXiv:2604.16054

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N. Balasubramanian, Tanuja Ganu

AI Summary

The paper introduces "Mind's Eye," a new benchmark designed to evaluate visuo-cognitive and visuospatial reasoning in MLLMs using tasks inspired by human intelligence tests and organized by an Abstraction-Relation-Transformation (A-R-T) taxonomy. Evaluating a range of MLLMs, the study finds that even top-performing models achieve only ~50% accuracy compared to human performance of 80%, revealing limitations in visual attention, perceptual manipulation, and abstract concept understanding. This highlights the need for more cognitively grounded evaluation frameworks to improve visuospatial reasoning in MLLMs.

Key Contribution

MLLMs still struggle with core visuospatial reasoning skills like abstraction and transformation, lagging far behind human performance on a new cognitive benchmark.

Abstract

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce"Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel"A-R-T"taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References64

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Related Papers