Ant GroupCUHKHarbin Engineering UniversityHuaweiNiuTrans ResearchNortheasternUMDJun 15, 2026arXiv:2606.16456

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

Weiqiao Shan, Ruixiang Mao, Yuang Li, Yuhao Zhang, Yingfeng Luo, Tong Zheng, Chen Xu, Yucheng Qiao, Chunxiang Jin, Yi Yuan, Jingdong Chen, Tong Xiao, Jingbo Zhu

AI Summary

This paper introduces SVD-Partitioned Residual Initialization (SPRI) to enhance the upcycling of pretrained dense models into sparse Mixture-of-Experts (MoE) models, particularly under data-constrained conditions. By leveraging SVD-partitioned residuals from pretrained feed-forward network weights, SPRI effectively introduces controlled diversity among routed experts while maintaining the pretrained weight structure. Evaluated on multilingual speech-to-text translation, SPRI significantly outperforms previous methods, achieving notable improvements in BLEU and COMET scores across multiple language directions.

Key Contribution

SPRI achieves a remarkable 3.39 BLEU point improvement over the best existing MoE upcycling method, demonstrating that pretrained weight structures can be effectively leveraged for better expert diversity.

Abstract

Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

Related Papers