Search papers, labs, and topics across Lattice.
This paper introduces D^2-Monitor, a bi-level safety monitoring system for diffusion LLMs that leverages the multi-step denoising process to detect safety-relevant information in intermediate hidden representations. The key insight is that "safety hesitation," measured by the frequency of intermediate states near a probe's decision boundary, predicts probe failure and sample difficulty. D^2-Monitor uses a lightweight probe for initial classification and hesitation estimation, dynamically activating a more expressive probe when hesitation exceeds a threshold, achieving state-of-the-art safety performance with minimal computational overhead.
Diffusion LLMs betray their safety violations through "hesitation" in their intermediate generation steps, offering a new signal for lightweight, dynamic safety monitoring.
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.