BAIRFeb 5, 2025arXiv:2502.10424

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Max Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

AI Summary

The paper introduces QuantSpec, a self-speculative decoding framework designed to accelerate long-context LLM inference on edge devices by addressing the KV cache bottleneck. QuantSpec employs a draft model that shares the target model's architecture but utilizes a hierarchical 4-bit quantized KV cache and 4-bit quantized weights. The method achieves high acceptance rates (over 90%) and provides consistent end-to-end speedups of up to 2.5x, while also reducing memory requirements by approximately 1.3x compared to other self-speculative decoding methods using sparse KV caches.

Key Contribution

Forget sparse KV caches – QuantSpec's hierarchical 4-bit quantization unlocks 2.5x speedups in long-context LLM inference with >90% acceptance rates.

Abstract

Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations10

Influential citations1

References55

Year2025

VenueInternational Conference on Machine Learning

Related Papers

Finding related papers...

Search

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Related Papers