BUPTMar 4, 2026arXiv:2603.03805

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang

AI Summary

RDB-PFN, a relational foundation model, is pre-trained on a large synthetic dataset of relational databases generated from a Relational Prior Generator, inspired by Prior-Data Fitted Networks (PFNs). This synthetic pre-training enables in-context learning on real-world relational prediction tasks, addressing the scarcity of high-quality, public RDBs. RDB-PFN achieves strong few-shot performance on 19 real-world tasks, outperforming graph-based and single-table baselines while using a lightweight architecture.

Key Contribution

Forget scraping private databases: RDB-PFN shows you can pre-train a relational foundation model from scratch using 2 million synthetically generated relational databases and achieve strong few-shot performance.

Abstract

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

Related Papers