Mar 2, 2026arXiv:2603.01399

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

AI Summary

The paper introduces Quasar, a training-free framework that accelerates LLM inference by applying low-bit quantization specifically to the verification stage of speculative decoding. Quasar addresses the memory-bandwidth bottleneck in the verification phase, which limits the speedup achievable by self-speculation and lookahead decoding. By quantizing the verification process, Quasar maintains speculative acceptance length comparable to full-precision methods while achieving a 1.28x improvement in end-to-end throughput on models like OpenPangu and Qwen3.

Key Contribution

Quantization can halve memory traffic during speculative decoding's verification stage, boosting end-to-end throughput by 28% without retraining.

Abstract

Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable speedup.In this paper, we introduce \textbf{Quasar} (\textbf{Qua}ntized \textbf{S}elf-speculative \textbf{A}cceleration for \textbf{R}apid Inference), a novel, training-free framework designed to overcome this"memory wall"by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state-of-the-art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full-precision methods while achieving a $1.28\times$ improvement in end-to-end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at https://github.com/Tom-HG/Quasar.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

Related Papers