NiuTrans ResearchNortheasternApr 16, 2026arXiv:2604.14889

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

Xinyu Liu, Bo Jin, Runsong Zhao, Pengcheng Huang, Junhao Ruan, Bei Li, Chunyang Xiao, Tong Xiao, Jingbo Zhu

AI Summary

MemoSight is introduced as a unified framework for efficient Chain-of-Thought (CoT) reasoning, addressing the KV cache scaling issues by integrating context compression and multi-token prediction. It uses special tokens and tailored position layouts for both compression and multi-token prediction, achieving a minimalist design. Experiments on four reasoning benchmarks show MemoSight reduces KV cache footprint by up to 66% and accelerates inference by 1.56x, surpassing existing CoT compression methods.

Key Contribution

Reasoning with LLMs just got a whole lot faster: MemoSight cuts KV cache footprint by 66% and speeds up inference by 1.56x without sacrificing CoT performance.

Abstract

While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate the efficiency issues while maintaining CoT reasoning performance. Our framework adopts the same minimalist design for both context compression and multi-token prediction via special tokens and their corresponding position layout tailored to each token type. Comprehensive experiments on four reasoning benchmarks demonstrate that MemoSight reduces the KV cache footprint by up to 66% and accelerates inference by 1.56x, while outperforming existing CoT compression methods.

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

Related Papers