Apr 23, 2026arXiv:2604.21396

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

B. Lim, Kyeonghyun Kim, Jung-Shin Yun, Youngbin Kim

AI Summary

The authors introduce VG-CoT, a new dataset for training and evaluating visual reasoning in LVLMs, which explicitly grounds each reasoning step to visual evidence within the image using an automated three-stage pipeline. This pipeline leverages object detection, OCR, and GPT-4o to generate step-by-step grounded reasoning and refine the grounding through rationale-driven open-set detection. Experiments using VG-CoT to train LLaVA-1.5 and Qwen2-VL show improvements in rationale quality, answer accuracy, and reasoning-answer alignment, demonstrating the dataset's effectiveness in enhancing trustworthy, evidence-based reasoning.

Key Contribution

Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.

Abstract

The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Related Papers