Search papers, labs, and topics across Lattice.
The paper introduces a precompute-reuse nibble multiplier architecture for low-power vector computing, addressing the area, power, and delay challenges of conventional multipliers in AI acceleration. It decomposes operands into nibbles, precomputes scaled multiples using shift-add logic, and accumulates results, avoiding wide lookup tables. RTL implementations in TSMC 28nm show up to 1.69x area reduction and 1.63x power improvement over shift-add multipliers, and 2.6x/2.7x savings compared to LUT-based array multipliers at 128-bit scale.
Ditch power-hungry LUTs: a new nibble-based multiplier slashes area by 2.6x and power by 2.7x in vector AI accelerators.
Vector multiplication is a fundamental operation for AI acceleration, responsible for over 85% of computational load in convolution tasks. While essential, these operations are primary drivers of area, power, and delay in modern datapath designs. Conventional multiplier architectures often force a compromise between latency and complexity: high-speed array multipliers demand significant power, whereas sequential designs offer efficiency at the cost of throughput. This paper presents a precompute-reuse nibble multiplier architecture that bridges this gap by reformulating multiplication as a structured composition of reusable nibble-level precomputed values. The proposed design treats each operand as an independent low-precision element, decomposes it into fixed-width nibbles, and generates scaled multiples of a broadcast operand using compact shift-add logic. By replacing wide lookup tables and multiway multiplexers with logic-based precomputation and regular accumulation, the architecture decouples cycle complexity from gate delay. The design completes each 8-bit multiplication in two deterministic cycles with a short critical path, scales efficiently across vector lanes, and significantly reduces area and energy consumption. RTL implementations synthesized in TSMC 28 nm technology demonstrate up to 1.69x area reduction and 1.63x power improvement over shift-add, and nearly 2.6x area and 2.7x power savings compared to LUT-based array multipliers at 128 bit scale.