Tsinghua AIHKUSTMARS LabMistralReceived 10 September 2025; revised 3Jun 9, 2026arXiv:2606.11052

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Xinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li, Yingfa Chen, Huiming Wang, Zhijiang Guo

AI Summary

This paper investigates the detrimental effects of chain-of-thought (CoT) supervised fine-tuning (SFT) on long-context recall in hybrid linear-attention models, revealing a significant drop in retrieval performance on complex tasks. The authors demonstrate that CoT-SFT biases attention gradients towards short-range patterns, leading to a breakdown in the query-key projections essential for long-range information retrieval. To address this issue, they introduce QK-Restore, a training-free method that selectively restores key parameters from pre-SFT checkpoints, resulting in improved long-context recall without sacrificing reasoning capabilities.

Key Contribution

CoT fine-tuning can slash long-range recall by over 57% in hybrid LLMs, but a simple parameter restoration method can reverse this trend without additional training.

Abstract

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections (W_Q, W_K) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.

Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Related Papers