Search papers, labs, and topics across Lattice.
This paper introduces AdaPLD, a training-free method that enhances speculative decoding by improving both retrieval and draft construction through adaptive strategies. By addressing the limitations of existing methods鈥攕pecifically, the challenges of lexical retrieval and deterministic span copying鈥擜daPLD achieves high-precision lexical reuse while leveraging semantic similarity to expand reuse opportunities. The results demonstrate that AdaPLD can significantly reduce target-model forward passes, achieving up to a 3.10x speedup in decoding across various benchmarks.
AdaPLD achieves up to 3.10x faster decoding by intelligently combining lexical and semantic strategies for token retrieval and hypothesis generation.
Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.