Apr 23, 2026arXiv:2604.21330

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

M. Kada, Ryota Yoshihashi, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

AI Summary

This paper introduces Teacher-Guided Routing for Sparse Vision Mixture-of-Experts (TGR-MoE) to address optimization challenges in sparse MoE training, specifically gradient blocking and unstable routing dynamics. TGR-MoE uses a teacher router, derived from a pretrained dense model, to provide pseudo-supervision to the student router, guiding expert selection. Experiments on ImageNet-1K and CIFAR-100 show that TGR-MoE improves accuracy and routing consistency while maintaining stable training even with high sparsity.

Key Contribution

Steal accuracy from dense models and stabilize MoE training with a simple teacher-guided routing scheme that combats gradient starvation.

Abstract

Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

Related Papers