Stanford HAIIITPManipal UniversityJun 27, 2025arXiv:2506.22396

QuickSilver - Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh

AI Summary

The paper introduces QuickSilver, a modular inference framework for LLMs that optimizes runtime without retraining or architectural changes. QuickSilver combines dynamic token halting, KV cache skipping, and contextual token fusion to reduce FLOPs during autoregressive decoding. Experiments on GPT-2 and Llama-2 demonstrate up to 39.6% FLOP reduction with minimal perplexity degradation on WikiText-103 and C4 datasets.

Key Contribution

Achieve up to 39.6% FLOP reduction in LLM inference without retraining or architectural changes using QuickSilver's dynamic token-level optimizations.

Abstract

Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches -- such as pruning, quantization, early exits, and speculative decoding -- often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (<=0.2).

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

QuickSilver - Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Related Papers