Search papers, labs, and topics across Lattice.
This paper introduces Adaptive Inverted-Index Routing for MoE (AIR-MoE), a novel two-stage routing algorithm for granular Mixture-of-Experts models that uses vector quantization for coarse shortlisting followed by fine scoring. AIR-MoE approximates top-k routing without full expert scoring or structural constraints, addressing the routing cost bottleneck in granular MoEs. Experiments show AIR-MoE outperforms existing routing methods in granular MoE settings, offering improved performance without architectural modifications or loss function changes.
Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.
Mixture-of-experts (MoE) models enable scalable transformer architectures by activating only a subset of experts per token. Recent evidence suggests that performance improves with increasingly granular experts, i.e., many small experts instead of a few large ones. However, this regime substantially increases routing cost, which can dominate computation. We introduce adaptive inverted-index routing for MoE (AIR-MoE), an inverted-index-inspired routing architecture based on vector quantization (VQ). In a first stage, AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts. In a second stage, fine scoring computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and, in contrast to prior work, imposing no structural constraints on expert parameters. AIR-MoE serves as a drop-in replacement for standard routers and requires no modifications to the model architecture or loss function. We further provide a lower bound on the mass recall achieved by AIR-MoE that yields insights into its inner workings. Empirically, we demonstrate that AIR-MoE achieves improved performance compared to existing routing approaches in granular MoE settings.