Search papers, labs, and topics across Lattice.
The paper introduces Vision-Language Causal Graphs (VLCGs) to explicitly represent causally relevant objects, attributes, relations, and scene-grounded assumptions for visual question answering. They then present ViLCaR, a diagnostic benchmark built upon VLCGs, designed to evaluate Causal Attribution, Causal Inference, and Question Answering in LVLMs using graph-aligned evaluation metrics. Experiments demonstrate that providing structured relevance information via VLCGs significantly improves attribution and inference consistency in state-of-the-art LVLMs, suggesting that structural guidance is key to improving causal reasoning.
LVLMs struggle with causal reasoning not because they lack the capacity, but because they lack structured guidance on what's relevant.
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.