Search papers, labs, and topics across Lattice.
The paper introduces Quasar, a training-free framework that accelerates LLM inference by applying low-bit quantization specifically to the verification stage of speculative decoding. Quasar addresses the memory-bandwidth bottleneck in the verification phase, which limits the speedup achievable by self-speculation and lookahead decoding. By quantizing the verification process, Quasar maintains speculative acceptance length comparable to full-precision methods while achieving a 1.28x improvement in end-to-end throughput on models like OpenPangu and Qwen3.
Quantization can halve memory traffic during speculative decoding's verification stage, boosting end-to-end throughput by 28% without retraining.
Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable speedup.In this paper, we introduce \textbf{Quasar} (\textbf{Qua}ntized \textbf{S}elf-speculative \textbf{A}cceleration for \textbf{R}apid Inference), a novel, training-free framework designed to overcome this"memory wall"by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state-of-the-art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full-precision methods while achieving a $1.28\times$ improvement in end-to-end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at https://github.com/Tom-HG/Quasar.