Search papers, labs, and topics across Lattice.
The paper introduces Excitation, a novel optimization framework for Mixture-of-Experts (MoEs) that dynamically modulates parameter updates based on batch-level expert utilization. Excitation addresses "structural confusion" in deep MoEs, where standard optimizers struggle to establish functional signal paths, by amplifying updates to highly-utilized experts and suppressing updates to low-utilization experts. Experiments across language and vision tasks demonstrate that Excitation improves convergence speed and final performance of MoE models without introducing additional per-parameter optimizer state or learnable parameters.
By selectively amplifying updates to highly-utilized experts, Excitation rescues deep Mixture-of-Experts models from "structural confusion," enabling stable training where standard optimizers fail.
We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation.