Tsinghua AIBITJDT AI InfraJun 4, 2026arXiv:2606.05742

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang

AI Summary

This paper introduces AdaPLD, a training-free method that enhances speculative decoding by improving both retrieval and draft construction through adaptive strategies. By addressing the limitations of existing methods—specifically, the challenges of lexical retrieval and deterministic span copying—AdaPLD achieves high-precision lexical reuse while leveraging semantic similarity to expand reuse opportunities. The results demonstrate that AdaPLD can significantly reduce target-model forward passes, achieving up to a 3.10x speedup in decoding across various benchmarks.

Key Contribution

AdaPLD achieves up to 3.10x faster decoding by intelligently combining lexical and semantic strategies for token retrieval and hypothesis generation.

Abstract

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

Inference & Quantization Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Related Papers