WestlakeZJUJun 1, 2026arXiv:2606.01813

Cost-Aware Diffusion Draft Trees for Speculative Decoding

Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai

AI Summary

This paper introduces CaDDTree, a novel method for speculative decoding that optimizes token throughput by jointly selecting the tree structure and node budget based on explicit modeling of draft and verification latencies. The authors demonstrate that traditional approaches fail to provide a principled basis for budget selection, as larger trees are always favored regardless of verification costs. Experiments reveal that CaDDTree consistently matches or exceeds the performance of DDTree with oracle budget selection across various benchmarks, showcasing its efficiency in generating tokens in real-time applications.

Key Contribution

By optimizing both tree structure and node budget, CaDDTree achieves superior token throughput without the need for offline budget searches, revolutionizing speculative decoding efficiency.

Abstract

Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.

Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cost-Aware Diffusion Draft Trees for Speculative Decoding

Related Papers