Search papers, labs, and topics across Lattice.
The paper introduces TabDLM, a novel diffusion-based framework for generating heterogeneous tabular data containing both numerical and free-form text fields. TabDLM leverages masked diffusion language models (MDLMs) to model text and categorical features while employing a continuous diffusion process with learned numeric token embeddings for numerical features. Experiments on diverse benchmarks demonstrate that TabDLM outperforms existing diffusion- and LLM-based methods in generating high-quality heterogeneous tabular data.
Generating realistic tabular data with both numbers and free-form text just got easier: TabDLM bridges the gap between diffusion models and LLMs for superior joint modeling.
Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical--language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.