Search papers, labs, and topics across Lattice.
This paper introduces TeraPool, a scaled-up cluster design featuring over 1000 RISC-V processing elements (PEs) sharing a multi-megabyte L1 memory, addressing the challenge of interconnect complexity in large shared-memory architectures. They implemented a low-latency hierarchical interconnect to manage the cores-to-L1-memory crossbar, achieving memory bank accesses with 9-13.5 pJ energy consumption. Fabricated in 12nm FinFET, TeraPool achieves 1.89 TFLOP/s peak performance and 200 GFLOP/s/W energy efficiency, demonstrating the viability of scaling shared-L1 clusters to a thousand PEs.
Forget scaling *out* -- TeraPool scales *up* to 1024 RISC-V cores sharing L1 memory, achieving impressive TFLOP/s and energy efficiency thanks to its hierarchical interconnect.
Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). <italic>Scaling out</italic> these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. <italic>Scaling up</italic> the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, <inline-formula><tex-math notation="LaTeX">${\boldsymbol >} 1000$</tex-math><alternatives><mml:math><mml:mrow><mml:mo mathvariant="bold">></mml:mo></mml:mrow><mml:mn>1000</mml:mn></mml:math><inline-graphic xlink:href="zhang-ieq1-3603692.gif"/></alternatives></inline-formula> floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte <inline-formula><tex-math notation="LaTeX">${\boldsymbol >} 4000$</tex-math><alternatives><mml:math><mml:mrow><mml:mo mathvariant="bold">></mml:mo></mml:mrow><mml:mn>4000</mml:mn></mml:math><inline-graphic xlink:href="zhang-ieq2-3603692.gif"/></alternatives></inline-formula>-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 <inline-formula><tex-math notation="LaTeX">$^{\boldsymbol{\circ}}$</tex-math><alternatives><mml:math><mml:msup><mml:mi/><mml:mrow><mml:mo mathvariant="bold">∘</mml:mo></mml:mrow></mml:msup></mml:math><inline-graphic xlink:href="zhang-ieq3-3603692.gif"/></alternatives></inline-formula>C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1<inline-formula><tex-math notation="LaTeX">${\boldsymbol \times}$</tex-math><alternatives><mml:math><mml:mrow><mml:mo mathvariant="bold">×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhang-ieq4-3603692.gif"/></alternatives></inline-formula> the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.