Search papers, labs, and topics across Lattice.
This paper introduces an end-to-end speculative decoding scheme to accelerate the inference of OpenPangu-7B on NPUs, specifically targeting the memory bandwidth limitations. The approach optimizes the speculative decoding algorithm for the NPU architecture to improve inference speed. Experimental results demonstrate a significant reduction in inference latency compared to baseline methods.
OpenPangu-7B inference on NPUs gets a serious speed boost via a custom-tailored speculative decoding scheme.
To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.