Search papers, labs, and topics across Lattice.
The paper investigates the quality-privacy tradeoff in tabular data synthesis, showing that existing models struggle in small-data regimes due to memorization. To mitigate this, they introduce DiffICL, a method that leverages in-context learning with pretrained structural priors to generate tabular data. Experiments on 14 real-world datasets demonstrate that DiffICL simultaneously improves data quality and privacy, outperforming dataset-specific generative models.
Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.
Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.