Search papers, labs, and topics across Lattice.
The paper introduces Canon, a parallel architecture that aims to bridge the gap between specialized accelerators and general-purpose architectures by exploiting data-level and instruction-level parallelism. Canon utilizes programmable Finite State Machines (FSMs) for dynamic data-driven orchestration, translating runtime meta-information into control instructions to encode high-level dataflow. The architecture also employs time-lapsed SIMD execution, issuing instructions across processing elements over multiple cycles to create a staggered pipelined execution and amortize control overhead.
Achieve specialized accelerator performance with general-purpose architecture flexibility using Canon, which dynamically orchestrates dataflow via programmable FSMs and time-lapsed SIMD execution.
Domain-specific accelerators deliver exceptional performance on their target workloads through fabrication-time orchestrated datapaths. However, such specialized architectures often exhibit performance fragility when exposed to new kernels or irregular input patterns. In contrast, programmable architectures like FPGAs, CGRAs, and GPUs rely on compile-time orchestration to support a broader range of applications; but they are typically less efficient under irregular or sparse data. Pushing the boundaries of programmable architectures requires designs that can achieve efficiency and high-performance on par with specialized accelerators while retaining the agility of general-purpose architectures. We introduce Canon, a parallel architecture that bridges the gap between specialized and general purpose architectures. Canon exploits data-level and instruction-level parallelism through its novel design. First, it employs a novel dynamic data-driven orchestration mechanism using programmable Finite State Machines (FSMs). These FSMs are programmed at compile time to encode high-level dataflow per state and translate incoming meta-information (e.g., sparse coordinates) into control instructions at runtime. Second, Canon introduces a time-lapsed SIMD execution in which instructions are issued across a row of processing elements over several cycles, creating a staggered pipelined execution. These innovations amortize control overhead, allowing dynamic instruction changes while constructing a continuously evolving dataflow that maximizes parallelism. Experimental evaluation shows that Canon delivers high performance across diverse data-agnostic and data-driven kernels while achieving efficiency comparable to specialized accelerators, yet retaining the flexibility of a general-purpose architecture.