University of Artificial IntelligenceMay 6, 2026arXiv:2605.04895

Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization

AI Summary

This paper audits transfer Bayesian Optimization (BO) papers and finds that most fail to control for the budget ratio (B/|A|), leading to unstable and misleading average treatment effect (ATE) comparisons. They introduce the Portable Regime Score (PRS) to capture the conditional effect of acquisition choice based on budget ratio and prior rank correlation. The authors demonstrate that PRS can predict performance reversals across different regimes and propose RegimePlanner, an algorithm that adapts acquisition strategies based on PRS, achieving state-of-the-art performance in HPO benchmarks.

Key Contribution

Unstable BO leaderboard rankings? They're likely due to ignoring the budget ratio (B/|A|) and prior rank correlation, which this paper elegantly captures with the Portable Regime Score (PRS) to predict performance reversals.

Abstract

Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. An audit of 40 transfer-BO papers from NeurIPS, ICML, ICLR, AISTATS, UAI, TMLR, JMLR, and AutoML-Conf (2022-2025) finds that 98% never vary B/|A| as a controlled axis. On the same GDSC2 benchmark, changing only the budget reverses the ranking: at B=50, Greedy outperforms UCB by 0.050 Hit@1, while at B=100, UCB outperforms Greedy by 0.035. We capture this transition with the Portable Regime Score PRS=(B/|A|)(1-rho), where rho is the prior rank correlation and can be estimated from pilot contexts before the main comparison. Across 79 conditions spanning chemistry, drug-response biology, and HPO, a hierarchical model gives beta=0.50 (p=1.1e-9), and 19% of conditions fall in an equivalence zone where |advantage|<0.01 Hit@1. In five published reversal cases, PRS predicts the winner from pre-comparison observables. A No-Free-Leaderboard proposition explains why unconditional rankings are unstable: when CATE changes sign across regimes, the reported ATE becomes a function of benchmark mixture. RegimePlanner, which estimates rho online and switches acquisition accordingly, wins all 16 HPO-B search spaces at B=100 and exceeds the matched {Greedy,UCB} per-context oracle on GDSC2 by 18%. Pre-registered predictions achieve 27/40=67.5% overall accuracy and above 90% within EMA prior families. The practical protocol is simple: report B/|A|, rho, K, and metric alongside any claimed acquisition advantage.

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization

Related Papers