Apr 23, 2026arXiv:2604.21231

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Hongyao Liu, L. Zhai, Junyi Wang, Zhengru Fang

AI Summary

This paper introduces SparKV, an adaptive KV cache loading framework for on-device LLM inference that strategically combines cloud-based KV streaming with local computation. SparKV uses a cost model to determine whether individual KV chunks should be streamed or computed locally, overlapping these processes to minimize latency. By dynamically adjusting offline-generated schedules at runtime to account for network and resource variability, SparKV achieves significant reductions in Time-to-First-Token (1.3x-5.1x) and energy consumption (1.5x-3.3x) with minimal impact on output quality.

Key Contribution

On-device LLM inference gets a massive speed and energy boost by adaptively streaming only the most expensive parts of the KV cache from the cloud.

Abstract

Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Related Papers