OPPOUSTCMar 3, 2026arXiv:2603.03383

Accelerating OpenPangu Inference on NPU via Speculative Decoding

AI Summary

This paper introduces an end-to-end speculative decoding scheme to accelerate the inference of OpenPangu-7B on NPUs, specifically targeting the memory bandwidth limitations. The approach optimizes the speculative decoding algorithm for the NPU architecture to improve inference speed. Experimental results demonstrate a significant reduction in inference latency compared to baseline methods.

Key Contribution

OpenPangu-7B inference on NPUs gets a serious speed boost via a custom-tailored speculative decoding scheme.

Abstract

To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.

Distributed Systems & Hardware Inference & Quantization Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Accelerating OpenPangu Inference on NPU via Speculative Decoding

Related Papers