Tsinghua AICASHKUPKUShenzhen University of AdvancedJun 9, 2026arXiv:2606.11164

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu

AI Summary

This paper introduces ReasonAlloc, a novel framework that addresses the inference bottlenecks in large language models caused by the rapid growth of key-value (KV) caches during long chain-of-thought reasoning. By employing a hierarchical budget allocation strategy that combines offline layer-wise preallocation and online head-wise resource reallocation, ReasonAlloc effectively adapts to the dynamic context demands of autoregressive reasoning. Evaluations demonstrate that ReasonAlloc significantly outperforms existing KV compression methods, particularly under constrained budgets, enhancing the efficiency of reasoning models without incurring substantial inference-time overhead.

Key Contribution

ReasonAlloc reallocates KV cache resources in real-time, achieving superior reasoning efficiency with minimal overhead.

Abstract

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

Inference & Quantization Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Related Papers