Search papers, labs, and topics across Lattice.
This paper introduces THInfer, a hardware-aware inference framework designed to optimize LLM inference on bandwidth-constrained, heterogeneous many-core processors like the MT-3000. THInfer employs a combination of hand-optimized FP16 kernels, computation graph fusion, and a Prefill-Buffer-Decode pipeline with hybrid parallelism to maximize data locality and inter-cluster communication efficiency. Experiments with Llama models demonstrate that THInfer achieves superior throughput compared to GPU-based frameworks like DeepSpeed, particularly for large models (70B) where GPU-based frameworks fail.
LLM inference on supercomputers doesn't have to be a bottleneck: THInfer achieves up to 84% higher throughput than A800 GPUs by co-designing hardware-aware kernels and a communication-optimized pipeline.
Large language model (LLM) inference is limited by high computational cost and memory bandwidth demands, making deployment on heterogeneous many-core processors challenging. Taking the MT-3000 processor used in the Tianhe supercomputer as an example, its limited main-memory bandwidth and distributed memory hierarchy exemplify these bottlenecks, making it difficult to directly migrate existing GPU-based inference frameworks. To address this problem, we propose THInfer, a hardware-aware inference framework that maximizes data locality under bandwidth-constrained conditions through hardware-software co-design and parallel strategy optimization. THInfer incorporates three key techniques: (1) a high-performance operator library for the VLIW SIMD architecture, providing hand-optimized FP16 kernels that achieve up to 70 percent of the peak performance per cluster; (2) a density-driven computation graph fusion and unified kernel scheduling mechanism, combined with a staged pipelined attention fusion method; and (3) a Prefill-Buffer-Decode (P-B-D) pipeline and bounded buffer management strategy, which supports hybrid parallelism and enables efficient multi-cluster collaboration through two-level communication based on MPI and hthreads. Experiments on the Llama model series show that THInfer improves throughput on the 7B model by 62 percent to 73 percent over DeepSpeed on two V100S GPUs and by 67 percent to 84 percent over the A800 GPU. The 13B and 30B models also demonstrate comparable or better performance. Moreover, THInfer maintains stable performance on the 70B model, whereas typical GPU-based frameworks fail to run under the same setting. Overall, THInfer significantly enhances throughput, reduces latency, and improves scalability, providing a feasible system solution for efficient and scalable LLM inference on heterogeneous many-core architectures.