Mar 9, 2026arXiv:2603.08436

Can Vision-Language Models Solve the Shell Game?

AI Summary

The authors introduce VET-Bench, a synthetic diagnostic benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to track visually identical objects through spatiotemporal continuity, revealing that current VLMs perform near chance level due to an over-reliance on static frame-level features. They theoretically prove that fixed-depth transformer-based VLMs are fundamentally limited in this task without intermediate supervision. To overcome this limitation, they propose Spatiotemporal Grounded Chain-of-Thought (SGCoT), which generates object trajectories as explicit intermediate states, achieving state-of-the-art accuracy on VET-Bench after fine-tuning on synthesized text-only data.

Key Contribution

VLMs can't play the shell game: they fail to track visually identical objects over time, revealing a surprising reliance on static features and a fundamental limitation in maintaining entity representations.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References78

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can Vision-Language Models Solve the Shell Game?

Related Papers