Search papers, labs, and topics across Lattice.
The paper investigates the evaluation of probabilistic predictions from tabular foundation models like TabPFN and TabICL in regression settings, highlighting the limitations of current benchmarks that focus solely on point estimates. It advocates for the use of proper scoring rules, particularly the continuous ranked probability score (CRPS), to assess the quality of distributional forecasts. The authors demonstrate that the choice of scoring rule influences the inductive bias and suggest fine-tuning or promptable tabular foundation models to improve probabilistic regression performance.
Tabular foundation models, despite excelling in point estimate benchmarks, need proper scoring rules like CRPS to reliably evaluate their probabilistic regression capabilities, revealing a crucial blind spot in current evaluation practices.
Prior-Data Fitted Networks (PFNs), such as TabPFN and TabICL, have revolutionized tabular deep learning by leveraging in-context learning for tabular data. These models are meant as foundation models for classification and regression settings and promise to greatly simplify deployment in practical settings because their performance is unprecedented (in terms of mean squared error or $R^2$, when measured on common benchmarks like TabArena or TALENT). However, we see an important weakness of current benchmarks for the regression setting: the current benchmarks focus on evaluating win rates and performance using metrics like (root) mean squared error or $R^2$. Therefore, these leaderboards (implicitly and explicitly) push researchers to optimize for machine learning pipelines which elicit a good mean value estimate. The main problem is that this approach only evaluates a point estimate (namely the mean estimator which is the Bayes estimator associated with the mean squared error loss). In this article we discuss the application of proper scoring rules for evaluating the goodness of probabilistic forecasts in distributional regression. We also propose to enhance common machine learning benchmarks with metrics for probabilistic regression. To improve the status quo and make the machine learning community aware of scoring rules for probabilistic regression, we advocate to use the continuous ranked probability score (CRPS) in benchmarks for probabilistic regression. However, we also illustrate that the choice of the scoring rule changes the inductive bias of the trained model. We, therefore, advocate for finetuning or promptable tabular foundation models.