Apr 5, 2026arXiv:2604.04195

Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach

Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, Richard Baraniuk

AI Summary

The paper introduces Non-Parametric Gaussian Copula (NPGC), a synthetic data generation method for educational data that preserves empirical marginal distributions and models dependencies using a copula framework. NPGC incorporates differential privacy at both the marginal and correlation levels, handling heterogeneous variable types and missing data explicitly. Experiments on benchmark datasets and a real-world platform show NPGC's stability across regeneration cycles, competitive downstream performance, and lower computational cost compared to deep learning and parametric baselines.

Key Contribution

Achieve stable, privacy-preserving synthetic educational data without deep learning, using a surprisingly simple copula-based approach that anchors to empirical marginals.

Abstract

To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC integrates Differential Privacy (DP) at both the marginal and correlation levels, supports heterogeneous variable types, and treats missing data as an explicit state to retain informative absence patterns. We evaluate NPGC against deep learning and parametric baselines on five benchmark datasets and demonstrate that it remains stable across multiple regeneration cycles and achieves competitive downstream performance at substantially lower computational cost. We further validate NPGC through deployment in a real-world online learning platform, demonstrating its practicality for privacy-preserving research.

Constitutional AI & AI Ethics Data Curation & Synthetic Data

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach

Related Papers