Apr 22, 2026arXiv:2604.20156

Temporally Extended Mixture-of-Experts Models

AI Summary

This paper introduces temporally extended Mixture-of-Experts (MoE) layers to reduce the high expert-switching frequency that hinders efficient serving of large MoEs. They adapt the option-critic framework from reinforcement learning, adding a controller to each layer that learns when and which expert sets to load, guided by a deliberation cost. Applying this to gpt-oss-20b with LoRA and self-distillation, they achieve a switch rate reduction from over 50% to below 5% while retaining up to 90% of base-model accuracy on benchmarks like MATH and MMLU.

Key Contribution

Dramatically cut MoE expert-switching rates (from 50% to <5%) with minimal accuracy loss by training a controller to decide *when* to switch, not just *which* expert to use.

Abstract

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Temporally Extended Mixture-of-Experts Models

Related Papers