Korea UApr 2, 2026arXiv:2604.01634

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo

AI Summary

The paper introduces CRIT, a new dataset and benchmark designed to evaluate and improve cross-modal multi-hop reasoning in VLMs by generating complex tasks across diverse domains using a graph-based automatic pipeline. CRIT addresses the limitations of existing multimodal benchmarks that often allow for single-modality inference, leading to hallucination and poor grounding in visual evidence. Training VLMs on CRIT results in significant improvements in cross-modal multi-hop reasoning, as demonstrated by strong performance gains on SPIQA and other standard multimodal benchmarks.

Key Contribution

VLMs still struggle to combine visual and textual information for multi-hop reasoning, but a new automatically generated dataset, CRIT, can help them learn.

Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References79

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Related Papers