Search papers, labs, and topics across Lattice.
The paper introduces Dual-Decoder (Dude), a novel architecture for accelerating natural language generation inference. Dude employs a two-decoder structure: a semantic decoder to capture long-term dependencies and an output decoder for fast sequence prediction. Experiments across NMT, text summarization, and question generation demonstrate a 1.43x to 1.62x speedup compared to standard autoregressive baselines, while maintaining comparable performance.
Autoregressive generation bottlenecks be gone: a dual-decoder architecture achieves up to 1.6x faster inference without sacrificing quality.
Natural language generation is an important task in natural language processing and has been applied in various scenarios. Most state-of-the-art generation models, however, are usually slow at inference time mainly due to the sequential dependencies of autoregressive generation and the use of more and more large-scale decoder models. To this end, we propose a Dual-Decoder (Dude) model to speed up the decoder without sacrificing the overall model performance. Dude model is composed of a semantic decoder and an output decoder, which are able to capture the long-term semantic dependencies and predict the target sequence fast as well. We evaluate Dude model on three natural language generation tasks including Neural Machine Translation, Text Summarization and Question Generation. The experimental results demonstrate that our model achieves 1.43× faster inference speed than the standard baseline while maintaining comparable performance, and even 1.62× faster on longer sequence generation tasks.