Macao Polytechnic UniversityApr 9, 2026arXiv:2604.08133

Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

Baihui Liu, Kaiyuan Tian, Wei Wang, Zhaoning Zhang, Linbo Qiao, Dongsheng Li

AI Summary

This paper introduces Alloc-MoE, a framework for optimizing expert activation allocation in Mixture-of-Experts models under a constrained "activation budget." Alloc-MoE operates at both the layer (Alloc-L) and token (Alloc-T) levels, using sensitivity profiling and dynamic programming to allocate activations across layers and redistributing activations based on routing scores within tokens. Experiments on DeepSeek-V2-Lite demonstrate that Alloc-MoE achieves significant speedups (1.15x prefill, 1.34x decode) at half the original activation budget while maintaining model performance.

Key Contribution

Squeeze 34% more decode speed out of your MoE model without sacrificing accuracy by intelligently budgeting expert activations.

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

Related Papers