Search papers, labs, and topics across Lattice.
This paper introduces a novel non-autoregressive decoding framework, NAR-MBR, which utilizes minimum Bayes' risk to enhance speech recognition performance while maintaining the speed advantages of non-autoregressive methods. By maximizing expected utility from multiple samples drawn from the output probability of an NAR model, the approach effectively mitigates the uncertainty issues inherent in traditional NAR decoding. Experimental results across various datasets, including LibriSpeech and Switchboard, show that NAR-MBR not only surpasses previous NAR methods but also operates faster than autoregressive decoding techniques.
NAR-MBR decoding achieves superior speech recognition accuracy while being faster than autoregressive methods, redefining efficiency in real-time applications.
Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is degraded because NAR decoding cannot resolve uncertainty by conditioning on previously generated tokens. To address this issue, we propose a novel NAR decoding framework based on minimum Bayes' risk (MBR) decoding, termed NAR-MBR decoding, that maximizes the expected utility calculated from samples drawn from the output probability of an NAR model rather than maximizing the output probability. Notably, by leveraging the nature of NAR models, multiple samples are obtained efficiently with a single forward computation. Our experiments across LibriSpeech, Switchboard, AMI, and web presentation corpus demonstrated that our NAR-MBR decoding outperformed previous NAR decoding and ran faster than AR decoding.