Amazon ScienceUIUCApr 16, 2026arXiv:2604.15153

Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

Zihao Xu, Zihao Xu, John Harvill, John Harvill, Ziwei Fan, Yizhou Sun, Yizhou Sun, Hao Ding, Hao Wang

AI Summary

This paper introduces K-Token Merging, a latent-space compression technique that merges blocks of K token embeddings into a single embedding using a lightweight encoder, aiming to reduce the computational cost of processing long prompts in LLMs. The compressed sequence is then processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments across structural reasoning, sentiment classification, and code editing tasks demonstrate that K-Token Merging achieves up to 75% input length reduction with minimal performance degradation, positioning it on the Pareto frontier of performance vs. compression.

Key Contribution

Achieve 75% input length reduction in LLMs with minimal performance loss by compressing token embeddings directly in the latent space.

Abstract

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

Related Papers