Search papers, labs, and topics across Lattice.
This paper introduces Dual Causal Inference (DCI), a novel framework for Medical Visual Question Answering (MedVQA) that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to address both observable and unobservable confounders. DCI formulates a Structural Causal Model (SCM) to mitigate cross-modal biases via BDA and compensate for unobserved confounders using an IV learned from a shared latent space, enforcing IV validity through mutual information constraints. Experiments on four MedVQA datasets demonstrate that DCI outperforms existing methods, particularly in out-of-distribution generalization, while also enhancing interpretability and robustness.
Medical VQA models can now reason more reliably thanks to a new framework that disentangles true causal effects from spurious correlations by jointly tackling observable and unobservable confounders.
Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.