MBZUAIApr 11, 2026arXiv:2604.10039

Counting to Four is still a Chore for VLMs

Duy Le Dinh Anh, Patrick Amadeus Irawan, T. Vo

AI Summary

This paper introduces COUNTINGTRICKS, a diagnostic dataset of shape-based counting tasks, to analyze counting failures in VLMs. Through attention analysis and component-wise probing, the authors find that count-relevant visual information is strong in early modality projection layers but weakens in later language layers, making models vulnerable to text priors. They then propose Modality Attention Share (MAS), an intervention that encourages visual attention during answer generation, and show that it improves counting accuracy by mitigating the underuse of visual evidence.

Key Contribution

VLMs struggle with basic counting not because they can't "see" the objects, but because they forget to look when generating the answer.

Abstract

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References21

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Counting to Four is still a Chore for VLMs

Related Papers