UtahWest Virginia UniversityJun 8, 2026arXiv:2606.09257

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

AI Summary

This paper introduces BSTabDiff, a generative framework specifically designed for High-Dimensional Low-Sample Size (HDLSS) tabular data by partitioning observed features into latent blocks to enhance dependence learning. By leveraging shared low-dimensional subunit variables and employing modern deep priors like diffusion and normalizing flows, BSTabDiff effectively addresses the challenges of local correlations, sparse dependencies, and structured missingness in HDLSS domains. Empirical evaluations demonstrate that BSTabDiff outperforms traditional unstructured tabular generators, yielding more realistic and stable synthetic data in these challenging settings.

Key Contribution

BSTabDiff achieves superior synthetic data generation in HDLSS contexts by intelligently leveraging block-subunit structures to capture complex dependencies.

Abstract

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

Data Curation & Synthetic Data Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Related Papers