Search papers, labs, and topics across Lattice.
The paper introduces Meta-Soft, a KV cache compression framework that dynamically synthesizes soft tokens from input prompts using a learnable orthogonal basis and a selector network with Gumbel-Softmax. This approach addresses the limitations of static soft token methods by adapting to different input prompts and capturing complex task relevance. Furthermore, Meta-Soft integrates an attention-flow mechanism to redistribute semantic information from evicted tokens, mitigating information loss.
LLMs can now compress their KV cache more effectively by dynamically synthesizing soft tokens tailored to the input, preserving crucial context that's otherwise lost with static methods.
The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance.Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features.We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively.Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.