Search papers, labs, and topics across Lattice.
This paper investigates the relationship between classification calibration and predictive multiplicity, where multiple near-optimal models produce conflicting predictions. The study uses nine credit risk datasets to show that predictive multiplicity disproportionately affects minority classes and concentrates in regions of low predictive confidence. The authors demonstrate that post-hoc calibration methods, particularly Platt Scaling and Isotonic Regression, can effectively reduce predictive multiplicity across the Rashomon set.
Post-hoc calibration isn't just about probabilities; it can also wrangle conflicting predictions from near-optimal models, especially for minority groups.
As machine learning models are increasingly deployed in high-stakes environments, ensuring both probabilistic reliability and prediction stability has become critical. This paper examines the interplay between classification calibration and predictive multiplicity - the phenomenon in which multiple near-optimal models within the Rashomon set yield conflicting credit outcomes for the same applicant. Using nine diverse credit risk benchmark datasets, we investigate whether predictive multiplicity concentrates in regions of low predictive confidence and how post-hoc calibration can mitigate algorithmic arbitrariness. Our empirical analysis reveals that minority class observations bear a disproportionate multiplicity burden, as confirmed by significant disparities in predictive multiplicity and prediction confidence. Furthermore, our empirical comparisons indicate that applying post-hoc calibration methods - specifically Platt Scaling, Isotonic Regression, and Temperature Scaling - is associated with lower obscurity across the Rashomon set. Among the tested techniques, Platt Scaling and Isotonic Regression provide the most robust reduction in predictive multiplicity. These findings suggest that calibration can function as a consensus-enforcing layer and may support procedural fairness by mitigating predictive multiplicity.