Ruhr UniversityTU DresdenApr 23, 2025arXiv:2602.23334

Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-Precision Quantized Multiplication on Hardware Accelerators

Yuhao Liu, Yuhao Liu, Salim Ullah, Salim Ullah, Akash Kumar

AI Summary

This paper introduces a runtime-reconfigurable bitwise systolic array architecture designed for multi-precision quantized neural network (QNN) accelerators, addressing the limitations of fixed-precision hardware in supporting mixed-precision quantization. The proposed architecture enables dynamic precision adjustments at runtime, allowing for a trade-off between resource consumption and accuracy in different layers of a QNN. Implemented on an Ultra96 FPGA, the design achieves a 1.3185x to 3.5671x speedup in inference for mixed-precision models and supports a higher clock frequency (250MHz) due to reduced critical path delay.

Key Contribution

Achieve up to 3.57x speedup in mixed-precision QNN inference by using a runtime-reconfigurable bitwise systolic array, enabling dynamic precision adjustments for optimal resource-accuracy trade-offs on hardware accelerators.

Abstract

Neural network accelerators have been widely applied to edge devices for complex tasks like object tracking, image recognition, etc. Previous works have explored the quantization technologies in related lightweight accelerator designs to reduce hardware resource consumption. However, low precision leads to high accuracy loss in inference. Therefore, mixed-precision quantization becomes an alternative solution by applying different precision in different layers to trade off resource consumption and accuracy. Because regular designs for multiplication on hardware cannot support the precision reconfiguration for a multi-precision Quantized Neural Network (QNN) model in runtime, we propose a runtime reconfigurable multi-precision multi-channel bitwise systolic array design for QNN accelerators. We have implemented and evaluated our work on the Ultra96 FPGA platform. Results show that our work can achieve 1.3185× to 3.5671× speedup in inferring mixed-precision models and has less critical path delay, supporting higher clock frequency (250MHz).

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations1

Influential citations0

References26

Year2025

VenueIEEE International Symposium on Quality Electronic Design

Related Papers

Finding related papers...

Search

Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-Precision Quantized Multiplication on Hardware Accelerators

Related Papers