Tsinghua AIGuangdong Laboratory of AI and Digital Economy (SZ)Independent ResearcherPolyUSYSUTencent AIUIUCMar 19, 2026arXiv:2603.18472

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Junnan Dong, Jun Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

AI Summary

This paper introduces a benchmark to evaluate MLLMs' ability to process discrete symbols across language, culture, mathematics, physics, and chemistry. The study reveals that MLLMs often fail at basic symbol recognition while succeeding in complex reasoning, indicating a reliance on linguistic priors rather than genuine visual understanding. This "cognitive mismatch" highlights a critical gap in MLLMs' ability to truly perceive and understand symbolic languages.

Key Contribution

MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these"discrete semantic spaces"across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this"cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References165

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Related Papers