Tsinghua AIMinistry of Education Key Laboratory of Intelligent Networks and Network SecurityTongjiUC Santa CruzXJTUFeb 12, 2026arXiv:2602.11761

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team Wenhao An, Yingfa Chen, Yewei Fang, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Hongya Lyu, Shixin Ren, Xingyu Shen, Haojun Sun, Yan-Ting Sun, Z. Thai, Xin-Yu Tian, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Zhu Zhang, Hengyu Zhao, Jiachen Zhao, Jie Zhou, Xuelin Han, Zhiyuan Liu, Maosong Sun

AI Summary

The paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that combines sparse attention (InfLLM-V2) and linear attention (Lightning Attention) to improve long-context modeling efficiency. A layer selection algorithm integrates the two attention mechanisms in a 1:3 ratio, and a hybrid positional encoding (HyPE) is used to maintain performance. The model achieves up to 3.5x faster inference speed than full-attention models at 256K sequence length on a single A6000D GPU and supports context lengths up to 1M tokens.

Key Contribution

Forget full attention: a hybrid sparse-linear attention model, MiniCPM-SALA, achieves 3.5x faster inference and supports 1M context length on a single GPU, all while maintaining comparable performance.

Abstract

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Related Papers