Mar 31, 2026arXiv:2603.29428

Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

AI Summary

The paper introduces a tool-guided inference framework for VLMs to address their failure in correctly interpreting visual illusions. The framework provides the VLM with a suite of generic image manipulation tools (line drawing, cropping, comparison, channel isolation) and a routing system to guide tool selection based on illusion type. This approach allows the VLM to reason about illusions by generating and referencing intermediate annotated views, leading to improved generalization across different illusion structures without requiring model training.

Key Contribution

Giving VLMs access to basic image manipulation tools and a strategic routing system dramatically improves their ability to "see through" visual illusions, even generalizing to unseen illusion types.

Abstract

Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as"real"regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

Related Papers