Search papers, labs, and topics across Lattice.
This paper critically evaluates the effectiveness of the WINO remasking method for masked diffusion language models (dLLMs), revealing that it offers minimal advantages over traditional confidence-based unmasking in standard decoding scenarios. The authors also observe that while confidence-based remasking can reduce errors in non-greedy decoding, it paradoxically worsens the issue of diversity collapse. These findings highlight the context-sensitive nature of remasking benefits, calling for a more nuanced evaluation framework in future research.
Confidence-based remasking in dLLMs may not deliver the expected improvements and can actually worsen diversity issues in certain decoding settings.
Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One appealing subset of these methods does so in a training-free, post-hoc manner based on token confidences, with encouraging early reported results. In this work, we revisit the empirical evaluation of a representative post-hoc remasking method, WINO [Hong et al., 2026], and find that under standard decoding settings (shorter block lengths) it brings little-to-no benefit over confidence-based unmasking alone [Wu et al., 2025]. Extending the evaluation to non-greedy decoding, we find that while confidence-based remasking can mitigate errors introduced by increased stochasticity to some extent, it also exacerbates the diversity collapse previously reported for confidence-based unmasking. Overall, our results show that the benefits of post-hoc confidence-based remasking are highly setting-dependent, underscoring the need for a more comprehensive evaluation framework.