ETHMay 28, 2026arXiv:2605.30170

Unveiling the Visual Counting Bottleneck in Vision-Language Models

Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan

AI Summary

This paper investigates why VLMs struggle with visual counting extrapolation, breaking down the task into visual individuation, magnitude awareness, and symbolic mapping. Through linear probing on synthetic Go boards, the authors find that VLMs maintain robust magnitude representations and comparative reasoning abilities even in extrapolation regimes. The core failure lies in the symbolic mapping stage, where models struggle to project visual magnitudes onto symbolic tokens for unseen quantities.

Key Contribution

VLMs don't lack visual understanding of quantity, they just can't connect what they see to symbolic number representations, revealing a fractured magnitude space.

Abstract

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unveiling the Visual Counting Bottleneck in Vision-Language Models

Related Papers