Search papers, labs, and topics across Lattice.
MemoSight is introduced as a unified framework for efficient Chain-of-Thought (CoT) reasoning, addressing the KV cache scaling issues by integrating context compression and multi-token prediction. It uses special tokens and tailored position layouts for both compression and multi-token prediction, achieving a minimalist design. Experiments on four reasoning benchmarks show MemoSight reduces KV cache footprint by up to 66% and accelerates inference by 1.56x, surpassing existing CoT compression methods.
Reasoning with LLMs just got a whole lot faster: MemoSight cuts KV cache footprint by 66% and speeds up inference by 1.56x without sacrificing CoT performance.
While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate the efficiency issues while maintaining CoT reasoning performance. Our framework adopts the same minimalist design for both context compression and multi-token prediction via special tokens and their corresponding position layout tailored to each token type. Comprehensive experiments on four reasoning benchmarks demonstrate that MemoSight reduces the KV cache footprint by up to 66% and accelerates inference by 1.56x, while outperforming existing CoT compression methods.