Mar 16, 2026arXiv:2603.15530

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Alish Kanani, Sangwan Lee, Han Lyu, Jiahao Lin, Jaehyun Park, Umit Y. Ogras

AI Summary

DUET is a disaggregated hardware accelerator architecture that assigns the prefill and decode phases of hybrid Mamba-Transformer LLMs to specialized hardware packages. The Prefill package uses systolic arrays for efficient matrix multiplications and long-sequence SSMs, while the Decode package uses vector-unit arrays with high-bandwidth memory to accelerate token-by-token SSMs and vector-matrix multiplications. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens compared to the B200 GPU.

Key Contribution

Hybrid Mamba-Transformer LLMs get a 4x speed boost in time-to-first-token and 1.4x higher throughput thanks to a new disaggregated accelerator architecture tailored to prefill and decode phases.

Abstract

Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Related Papers