Search papers, labs, and topics across Lattice.
The paper introduces FineRMoE, a novel Mixture-of-Experts (MoE) architecture that extends fine-grained expert design to both intermediate and output dimensions to overcome performance limitations of single-dimension fine-grained MoEs. A bi-level sparse forward computation paradigm and specialized routing mechanism are introduced to manage expert activation. To reduce training costs, a generalized upcycling method is proposed to build FineRMoE efficiently.
FineRMoE achieves 6x higher parameter efficiency, 281x lower prefill latency, and 136x higher decoding throughput compared to strong baselines, demonstrating a significant leap in MoE performance.
As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.