Georgia TechUMassMar 2, 2026arXiv:2603.01331

MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Wenke Lee

AI Summary

The paper introduces MetaState, a recurrent augmentation for discrete diffusion language models (dLLMs) that addresses the "Information Island" problem by maintaining a persistent, fixed-size working memory across denoising steps. MetaState comprises a cross-attention Mixer, a GRU-style Updater, and a cross-attention Injector, enabling the model to retain and integrate information from previous denoising steps. Fine-tuning MetaState modules with K-step unrolling on LLaDA-8B and Dream-7B demonstrates improved accuracy over frozen dLLM baselines, highlighting the benefits of persistent cross-step memory.

Key Contribution

Discrete diffusion language models can now achieve higher accuracy without retraining the entire backbone, thanks to a lightweight recurrent memory module that bridges denoising steps.

Abstract

Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Related Papers