Ohio StateSJTUApr 15, 2026arXiv:2604.13634

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

Xuwen Zhou, Fangxin Liu, Chao Wang, Xiao Zheng, Haibing Guan, Haibing Guan

AI Summary

Calibrated Speculative Decoding (CSD) improves speculative decoding by rescuing valid but lexically divergent tokens that are typically rejected. CSD uses an Online Correction Memory to propose rescue candidates based on historical rejections and Semantic Consistency Gating to verify admissibility using probability ratios. Experiments across various LLMs show that CSD achieves up to 2.33x throughput speedup while preserving or improving accuracy, especially on complex reasoning tasks.

Key Contribution

Speculative decoding can be sped up by >2x without sacrificing accuracy by rescuing previously rejected tokens that are semantically valid but lexically different.

Abstract

Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of"Frequency-Guided Candidate Selection and Probability-Guarded Acceptance,"CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

Related Papers