May 6, 2026arXiv:2605.04838

PAIR-CI: Calibrated Conditional Independence Testing for Causal Discovery with Incomplete Data

AI Summary

The paper addresses the problem of miscalibration in constraint-based causal discovery methods when dealing with incomplete data, where imputation errors lead to spurious conditional dependencies and inflated false positive rates in conditional independence (CI) tests. They introduce PAIR-CI, a novel nonparametric CI test that integrates multiple imputation directly into the testing procedure using a paired permutation design and cross-validation. This approach ensures that imputation errors cancel out, leading to a calibrated test with a provably consistent variance estimator that accounts for uncertainty from both cross-validation and multiple imputation.

Key Contribution

Existing causal discovery methods can be dangerously wrong when data is missing, but PAIR-CI slashes false positives by directly accounting for imputation errors, leading to more accurate causal graphs.

Abstract

The standard constraint-based paradigm for causal discovery with incomplete data -- impute first, test second -- is frequently miscalibrated: any consistent conditional independence (CI) test rejects a true null with probability approaching 1 when imputation error induces spurious conditional dependence. We introduce PAIR-CI, a nonparametric CI test that restores calibration by integrating multiple imputation directly into the inferential procedure via a paired permutation design. PAIR-CI compares cross-validated models that include and exclude the candidate variable while receiving the same imputed conditioning set, forcing imputation error to cancel in their loss difference rather than contaminate the test statistic. A provably consistent variance estimator jointly accounts for uncertainty arising from cross-validation and multiple imputation -- to our knowledge, the first formal unification of these two inferential frameworks. In simulations, existing imputation-based CI tests exhibit false positive rates of 28--45% when data are missing not at random (MNAR), whereas PAIR-CI averages below the nominal 5% level across data-generating processes and missingness mechanisms. These gains are largest in nonlinear settings and grow with causal graph size: when integrated into the PC algorithm, PAIR-CI reduces structural Hamming distance by 8% on 10-variable nonlinear graphs, 15% on 30-variable equivalents, and up to 44% on the 56-variable HAILFINDER network, with stable performance in all settings.

Data Curation & Synthetic Data Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PAIR-CI: Calibrated Conditional Independence Testing for Causal Discovery with Incomplete Data

Related Papers