Search papers, labs, and topics across Lattice.
This paper introduces the Full-Space Quantization-driven Architecture (FQA) for efficiently approximating nonlinear activation functions through piecewise polynomial approximations (PPAs). By addressing both fractional-bit truncation and quantization errors, FQA optimally determines approximation coefficients while enabling flexible hardware implementations tailored to various resource-performance needs. Experimental results reveal that FQA achieves over 50% reductions in area and power consumption for the Sigmoid function compared to existing architectures, while also minimizing the Maximum Absolute Error (MAE).
FQA slashes area and power consumption by over 50% for Sigmoid activation functions while maintaining optimal approximation accuracy.
In this paper, we propose a full-space quantization-driven architecture (FQA) for the hardware-efficient piecewise polynomial approximations (PPAs) of nonlinear activation functions. FQA comprehensively considers both fractional-bit truncation error and quantization error that cause the deviation of the optimal approximation coefficients. Crucially, FQA can precisely determine and search the complete range of optimal coefficients. Based on the proposed FQA, we develop two distinct hardware implementation schemes to cater to different resource-performance trade-offs. Furthermore, we decouple all the fractional word lengths (FWLs) involved in the calculation process to enable the exploration of superior hardware architectures. To mitigate the increased software computation time caused by the expanded quantization space, we design an acceleration method named TBW (target-guided bisection window) to expedite the piecewise calculation and searching process. Experimental results demonstrate that, compared to existing architectures, FQA can significantly reduce the number of required segments while achieving the optimal Maximum Absolute Error (MAE). For the hardware design of the Sigmoid function, our approach achieves over 50% reduction in area and power consumption compared to the state-of-the-art PPA architecture. Finally, we present a complete design workflow for deploying PPA on configurable hardware, maximizing the utilization of existing hardware resources and minimizing MAE.