CASMar 5, 2026arXiv:2603.04971

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

AI Summary

The paper introduces Mixture of Universal Experts (MOUE), a novel MoE architecture that scales model capacity by converting depth into "virtual width" through the reuse of a universal, layer-agnostic expert pool across layers. To address challenges like routing path explosion and exposure mismatch, MOUE employs a Staggered Rotational Topology, Universal Expert Load Balance, and a Universal Router. Experiments demonstrate that MOUE outperforms comparable MoE baselines by up to 1.3% and enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, establishing virtual width as a viable scaling dimension.

Key Contribution

Forget scaling depth and width—MOUE unlocks a new "virtual width" dimension for Mixture-of-Experts by cleverly reusing a single expert pool across layers.

Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Related Papers