Search papers, labs, and topics across Lattice.
This paper introduces Diffusion Draft Tree (DDTree), a novel speculative decoding method that leverages a block diffusion drafter to construct a tree of draft continuations. DDTree uses a best-first heap algorithm to select the most promising continuations based on the draft model's output, allowing for efficient verification of multiple trajectories in a single forward pass of the target model. Experiments demonstrate that DDTree, building upon the DFlash drafter, achieves state-of-the-art speculative decoding performance.
Unleashing the full potential of block diffusion drafting, DDTree's tree-based verification strategy dramatically accelerates speculative decoding by exploring multiple likely continuations in parallel.
Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.