Search papers, labs, and topics across Lattice.
DUET is a disaggregated hardware accelerator architecture that assigns the prefill and decode phases of hybrid Mamba-Transformer LLMs to specialized hardware packages. The Prefill package uses systolic arrays for efficient matrix multiplications and long-sequence SSMs, while the Decode package uses vector-unit arrays with high-bandwidth memory to accelerate token-by-token SSMs and vector-matrix multiplications. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens compared to the B200 GPU.
Hybrid Mamba-Transformer LLMs get a 4x speed boost in time-to-first-token and 1.4x higher throughput thanks to a new disaggregated accelerator architecture tailored to prefill and decode phases.
Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.