Search papers, labs, and topics across Lattice.
This paper reviews the use of synthetic data generated by AI models for statistical inference, focusing on the assumptions required for validity, reliability, and principled application. It surveys generative models, their use cases, benefits, limitations, and failure modes, while also examining pitfalls like model misspecification and attenuated uncertainty. The paper then discusses emerging frameworks for principled synthetic data use and provides practical recommendations for researchers.
Synthetic data from generative AI can mislead statistical inference if used naively, but this paper clarifies the assumptions and pitfalls to avoid, offering a roadmap for principled application.
The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise fundamental statistical questions about when synthetic data can be used in a valid, reliable, and principled manner. This paper reviews the current landscape of synthetic data generation and use from a statistical perspective, with the goal of clarifying the assumptions under which synthetic data can meaningfully support downstream discovery, inference, and prediction. We survey major classes of modern generative models, their intended use cases, and the benefits they offer, while also highlighting their limitations and characteristic failure modes. We additionally examine common pitfalls that arise when synthetic data are treated as surrogates for real observations, including biases from model misspecification, attenuated uncertainty, and difficulties in generalization. Building on these insights, we discuss emerging frameworks for the principled use of synthetic data. We conclude with practical recommendations, open problems, and cautions intended to guide both method developers and applied researchers.