Feb 25, 2026arXiv:2602.21798

Excitation: Momentum For Experts

AI Summary

The paper introduces Excitation, a novel optimization framework for Mixture-of-Experts (MoEs) that dynamically modulates parameter updates based on batch-level expert utilization. Excitation addresses "structural confusion" in deep MoEs, where standard optimizers struggle to establish functional signal paths, by amplifying updates to highly-utilized experts and suppressing updates to low-utilization experts. Experiments across language and vision tasks demonstrate that Excitation improves convergence speed and final performance of MoE models without introducing additional per-parameter optimizer state or learnable parameters.

Key Contribution

By selectively amplifying updates to highly-utilized experts, Excitation rescues deep Mixture-of-Experts models from "structural confusion," enabling stable training where standard optimizers fail.

Abstract

We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Excitation: Momentum For Experts

Related Papers