Apr 22, 2026arXiv:2604.20306

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, Yuting Su

AI Summary

This paper introduces Dual Causal Inference (DCI), a novel framework for Medical Visual Question Answering (MedVQA) that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to address both observable and unobservable confounders. DCI formulates a Structural Causal Model (SCM) to mitigate cross-modal biases via BDA and compensate for unobserved confounders using an IV learned from a shared latent space, enforcing IV validity through mutual information constraints. Experiments on four MedVQA datasets demonstrate that DCI outperforms existing methods, particularly in out-of-distribution generalization, while also enhancing interpretability and robustness.

Key Contribution

Medical VQA models can now reason more reliably thanks to a new framework that disentangles true causal effects from spurious correlations by jointly tackling observable and unobservable confounders.

Abstract

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Related Papers