Tsinghua AISEUSiemens AIMay 6, 2026arXiv:2605.04911

Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

Xinyan Han, Yan Lu, Xiaoyu Lin, Yuanyuan Jiang, Yuanrui Wang, Xuanyue Li, Wenchao Zou, Xingxuan Zhang

AI Summary

The paper investigates the quality-privacy tradeoff in tabular data synthesis, showing that existing models struggle in small-data regimes due to memorization. To mitigate this, they introduce DiffICL, a method that leverages in-context learning with pretrained structural priors to generate tabular data. Experiments on 14 real-world datasets demonstrate that DiffICL simultaneously improves data quality and privacy, outperforming dataset-specific generative models.

Key Contribution

Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.

Abstract

Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

Related Papers