Mar 2, 2026arXiv:2603.02333

Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva

AI Summary

This paper investigates memorization in Diffusion Language Models (DLMs) by developing a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation. The authors theoretically prove a monotonic relationship between sampling resolution and memorization, showing that higher resolution leads to increased extraction probability, with autoregressive decoding as a limiting case. Empirical results validate the theory and demonstrate that DLMs exhibit lower memorization-based leakage of PII compared to Autoregressive Language Models (ARMs) under aligned prefix-conditioned evaluations.

Key Contribution

Diffusion Language Models leak less PII than Autoregressive Models, suggesting a potential privacy advantage in generative AI.

Abstract

Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Related Papers