Search papers, labs, and topics across Lattice.
The paper introduces Expert Threshold (ET) routing, a novel approach for Token-Choice Mixture-of-Experts (TC-MoE) in autoregressive language models that dynamically allocates computation based on learned thresholds. Unlike TC-MoE, ET routing allows each token to be independently routed to an expert if its score exceeds the expert's EMA threshold, eliminating the need for auxiliary losses to ensure load balancing. Experiments pretraining a 2.4B parameter model on FineWeb-Edu demonstrate that ET routing achieves a 0.067 reduction in cross-entropy loss compared to TC-MoE, equivalent to a 1.6x data efficiency improvement.
Ditch the auxiliary losses: Expert Threshold routing achieves better load balancing and language modeling performance than Token-Choice MoE by dynamically routing tokens based on learned thresholds.
Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.