May 28, 2026arXiv:2605.29707

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Jianuo Huang, Yaojie Zhang, Qituan Zhang, Haobin Lin, Hanlin Xu, Linfeng Zhang

AI Summary

This paper introduces Domino, a novel speculative decoding framework that separates causal modeling from the costly autoregressive drafting process in large language models (LLMs). By employing a parallel draft backbone to generate initial draft distributions and a lightweight Domino head for causal refinement, the method significantly enhances inference speed without sacrificing draft quality. Experimental results demonstrate that Domino achieves up to 5.49 times end-to-end speedup and 5.8 times throughput improvement on Qwen3 models, showcasing its efficiency in practical applications.

Key Contribution

Domino achieves a remarkable 5.49x speedup in LLM inference by decoupling causal modeling from the drafting process, challenging traditional autoregressive constraints.

Abstract

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Related Papers