Search papers, labs, and topics across Lattice.
This paper investigates decoding strategies for diffusion language models (DLMs) in the context of automatic speech recognition (ASR), comparing fixed-number decoding rounds against static and dynamic confidence thresholding. Using negative log-likelihood as a proxy for decoding progress, the authors demonstrate that confidence-based thresholding significantly improves both accuracy and speed compared to fixed-number approaches. The key finding is that a static confidence threshold can match the accuracy of autoregressive ASR while achieving superior efficiency due to the early convergence of most tokens in ASR tasks.
Ditch fixed-length decoding for diffusion-based ASR: confidence-based thresholds unlock autoregressive accuracy with diffusion-level parallelism.
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.