Dongguk UniversityMar 12, 2026arXiv:2603.11631

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jang-Ho Choi, Jihie Kim

AI Summary

VisDoT enhances visual reasoning in LVLMs by explicitly grounding them in human-like perceptual tasks like position and length estimation, based on the theory of graphical perception. They introduce Decomposition-of-Thought (DoT) prompting, which separates questions into visual perception and logic sub-questions. Fine-tuning InternVL with VisDoT achieves state-of-the-art results on ChartQA and ChartQAPro, and shows zero-shot gains on open-domain VQA benchmarks, demonstrating the generalizability of their approach.

Key Contribution

LVLMs can achieve state-of-the-art chart understanding by mimicking human perceptual processes and decomposing reasoning into perception and logic sub-problems.

Abstract

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Related Papers