DAMOKey Laboratory of Machine Intelligence and Advanced ComputingShenzhen Loop Area InstituteSYSUJun 4, 2026arXiv:2606.06335

Bridging Domain Expertise and Generalization for Performance Estimation

Shuxuan Li, Zhilin Zhao, Quyu Kong, Wei-Shi Zheng

AI Summary

This paper introduces Fused Reference Alignment Prediction (FRAP), a novel method for performance estimation under distribution shifts that combines the strengths of an external foundation model with a base model. By employing temperature-scaled calibration to align prediction distributions and using confidence-based weighting to fuse these predictions, FRAP creates a more reliable surrogate for ground-truth labels. Experimental results demonstrate that FRAP significantly outperforms existing performance estimation methods across various datasets and architectures, highlighting its effectiveness in mitigating the biases introduced by distribution shifts.

Key Contribution

FRAP achieves substantial improvements in performance estimation under distribution shifts by effectively merging the strengths of foundation and base models.

Abstract

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bridging Domain Expertise and Generalization for Performance Estimation

Related Papers