Search papers, labs, and topics across Lattice.
This paper compares the internal representations of diffusion language models (dLLMs) and autoregressive language models (AR models) across layers and tokens, finding that dLLMs exhibit more hierarchical abstractions and early-layer redundancy compared to the tightly coupled, depth-dependent representations of AR models. It also shows that AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, indicating initialization bias. Leveraging the observed redundancy in dLLMs, the authors introduce a static, task-agnostic inference-time layer-skipping method that achieves up to 18.75% FLOPs reduction with minimal performance loss on reasoning and code generation tasks.
Diffusion language models have surprisingly redundant early layers, enabling nearly 20% FLOPs reduction at inference time via layer skipping without sacrificing performance.
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.