Search papers, labs, and topics across Lattice.
This paper introduces Fused Reference Alignment Prediction (FRAP), a novel method for performance estimation under distribution shifts that combines the strengths of an external foundation model with a base model. By employing temperature-scaled calibration to align prediction distributions and using confidence-based weighting to fuse these predictions, FRAP creates a more reliable surrogate for ground-truth labels. Experimental results demonstrate that FRAP significantly outperforms existing performance estimation methods across various datasets and architectures, highlighting its effectiveness in mitigating the biases introduced by distribution shifts.
FRAP achieves substantial improvements in performance estimation under distribution shifts by effectively merging the strengths of foundation and base models.
Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.