Tsinghua AIApr 15, 2026arXiv:2604.13556

YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

You Wu, Ziheng Chen, Yizhen Zhang, Chengting Yu, Yuchi Xu, Wenbo Su, Kewei Tu

AI Summary

This paper introduces YOCO++, an improvement to the YOCO cross-layer KV cache compression technique for efficient LLM inference. YOCO++ adds a weighted residual connection between the KVs of each bottom-half layer and the bottom layer to increase model capacity without sacrificing efficiency. Experiments demonstrate that YOCO++ achieves state-of-the-art performance among cross-layer KV compression methods at a 50% compression rate, even surpassing the standard Transformer.

Key Contribution

YOCO++ proves you can halve the KV cache size in LLMs and still beat a standard Transformer, thanks to a clever residual connection trick.

Abstract

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

Related Papers