Search papers, labs, and topics across Lattice.
This paper investigates whether dataset meta-features can explain performance differences between tabular foundation models and traditional models on the TabArena benchmark. They analyze dataset-level performance gaps in relation to model-agnostic dataset descriptors, employing strict statistical tests with false discovery control. The key finding is that global meta-feature approaches are not robust enough to explain performance differences across the 51 datasets in TabArena, with only limited success in specific model comparisons.
Turns out, dataset meta-features can't reliably explain why one tabular model beats another, suggesting tabular data is more heterogeneous than we thought.
With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps between model families on tabular prediction tasks. Using the TabArena benchmark results, we analyze dataset-level performance gaps and relate them to model-agnostic dataset descriptors. After strict statistical tests with false discovery control, we find that (1) for neural network vs. tree gaps, no meta-feature survives false discovery control, (2) for non-foundation vs. foundation model gaps, one association is robust but does not generalize when tested in leave-one-dataset-out prediction, and (3) for TabICLv2 vs. TabPFN-2.6, one robust association also improves held-out prediction. Furthermore, we conduct a leave-one-dataset-out analysis and find that meta-feature predictors fail to improve meaningfully over a simple baseline. Overall, our results show the heterogeneity of tabular datasets and that global meta-feature approaches are not robust enough to offer explanations on the 51 TabArena datasets.