Search papers, labs, and topics across Lattice.
DyMoE dynamically quantizes MoE experts at runtime based on importance and depth, addressing the memory and I/O bottlenecks that hinder MoE inference on edge devices. It uses importance-aware prioritization for quantization, depth-adaptive scheduling, and look-ahead prefetching to optimize performance. Experiments on edge hardware demonstrate significant speedups in TTFT (3.44x-22.7x) and TPOT (up to 14.58x) compared to existing methods, enabling real-time MoE inference.
Edge devices can now run MoEs in real-time thanks to a dynamic quantization scheme that prioritizes important experts and critical layers.
Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.