Search papers, labs, and topics across Lattice.
The paper introduces TriMoE, a heterogeneous GPU-CPU-NDP architecture for efficient MoE inference that addresses the compute gap caused by warm experts in existing offloading approaches. TriMoE strategically maps hot, warm, and cold experts to GPU, AMX-enabled CPU, and DIMM-NDP respectively, based on their compute and memory access characteristics. By incorporating a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme, TriMoE achieves significant performance improvements over existing single-GPU MoE inference solutions.
A novel GPU-CPU-NDP architecture, TriMoE, unlocks 2.83x faster MoE inference by intelligently routing "hot," "warm," and "cold" experts to the compute unit where they thrive.
To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.