Jun 9, 2026arXiv:2606.10722

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

Ruixuan Huang, Jinyuan Shi, Hantao Huang, Yifan Huang, Ziyi Guan, Hao Zeng, Ian En-Hsu Yen, Minghui Yu

AI Summary

This paper explores the continual training of large language models (LLMs) by transforming a dense Qwen2.5-8B backbone into a channel-sparse model through a novel predictor-gated sparse SwiGLU feedforward network (FFN). By implementing a low-rank predictor to generate routing logits and applying a bank-wise top-k selection, the authors achieve a fourfold increase in sparsity while maintaining performance during training. The study not only details the architecture and training process but also addresses a specific long-context failure mode, proposing a repair algorithm that enhances performance in affected scenarios.

Key Contribution

Transforming dense LLMs into hardware-efficient sparse models can achieve 4x sparsity without sacrificing performance, revolutionizing model deployment strategies.

Abstract

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

Related Papers