Apr 7, 2026arXiv:2604.05688

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, Jinlong Li

AI Summary

Attention Editing enables converting pre-trained LLMs to more efficient attention architectures like MLA and GateSWA without full retraining. The method uses progressive distillation with layer-wise teacher forcing and model-level next-token prediction, optionally with feature matching regularization. Experiments on Qwen3 models show that attention conversion maintains performance while improving efficiency, even when training on Ascend 910B clusters.

Key Contribution

Swap out your LLM's inefficient attention mechanism for a faster one without retraining from scratch.

Abstract

Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target--MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Related Papers