Zhengyu Chen

D tensors: 𝑸:,i,:NoPE:=𝑸NoPE[:,i,:],𝑸:,i,:RoPE:=𝑸RoPE[:,i,:],𝑲:,i,:NoPE:=𝑲NoPE[:,i,:],𝑽:,i,::=𝑽[:,i,:].\bm{\mathsfit{Q}}^{\text{NoPE}}_{:,i,:}:=\bm{\mathsfit{Q}}^{\text{NoPE}}\left[:,\,i,\,:\right],\ \bm{\mathsfit{Q}}^{\text{RoPE}}_{:,i,:}:=\bm{\mathsfit{Q}}^{\text{RoPE}}\left[:,\,i,\,:\right],\ \bm{\mathsfit{K}}^{\text{NoPE}}_{:,i,:}:=\bm{\mathsfit{K}}^{\text{NoPE}}\left[:,\,i,\,:\right],\ \bm{\mathsfit{V}}_{:,i,:}:=\bm{\mathsfit{V}}\left[:,\,i,\,:\right]. To incorporate positional information, MLA shares a common RoPE key 𝑲RoPE\bm{K}^{\text{RoPE}} across all attention heads. The final position-aware query and key for head ii are formed by concatenating their respective NoPE and RoPE components: 𝑸:,i,:=Concat⁡([𝑸:,i,:NoPE,𝑸:,i,:RoPE],dim=1),𝑲:,i,:=Concat⁡([𝑲:,i,:NoPE,𝑲RoPE],dim=1).\bm{\mathsfit{Q}}_{:,i,:}=\operatorname{Concat}\left(\left[\bm{\mathsfit{Q}}_{:,i,:}^{\text{NoPE}},\,\bm{\mathsfit{Q}}_{:,i,:}^{\text{RoPE}}\right],\,\text{dim=1}\right),\quad\bm{\mathsfit{K}}_{:,i,:}=\operatorname{Concat}\left(\left[\bm{\mathsfit{K}}_{:,i,:}^{\text{NoPE}},\,\bm{K}^{\text{RoPE}}\right],\,\text{dim=1}\right). Efficient Decoding. Recall that MLA utilizes up-projection matrices 𝑾UK\bm{W}^{\text{UK}} and 𝑾UV\bm{W}^{\text{UV}}. We extract the dhd_{h}-column slices for head ii to define its head-specific projections: 𝑾:,(i)UK:=𝑾UK[:,idh:(i+1)dh],𝑾:,(i)UV:=𝑾UV[:,idh:(i+1)dh].\bm{W}_{:,(i)}^{\text{UK}}:=\bm{W}^{\text{UK}}\left[:,\,id_{h}:(i+1)d_{h}\right],\quad\bm{W}_{:,(i)}^{\text{UV}}:=\bm{W}^{\text{UV}}\left[:,\,id_{h}:(i+1)d_{h}\right]. We refer to the DeepSeek official inference implementation (DeepSeek and others, 2024c) to illustrate how to “absorb” up-projection matrices into the queries to avoid explicit KV materialization in MLA decoding. For the prefix sequence {0,…,n−1}\{0,\ldots,n-1\} with cached components 𝑪KV\bm{C}^{\text{KV}} and 𝑲RoPE\bm{K}^{\text{RoPE}}, we define the head-wise up-projections by partitioning 𝑾UK\bm{W}^{\text{UK}} and 𝑾UV\bm{W}^{\text{UV}} into hh heads, {𝑾:,(i)UK}i=0h−1\{\bm{W}_{:,(i)}^{\text{UK}}\}_{i=0}^{h-1} and {𝑾:,(i)UV}i=0h−1\{\bm{W}_{:,(i)}^{\text{UV}}\}_{i=0}^{h-1}. For the last prefix token at position n−1n-1, let 𝑸n−1,i,:\bm{\mathsfit{Q}}_{n-1,i,:} denote the query vector for the ii-th attention head. To maintain variance during the dot-product operation, we apply the scaling factor τ=1dh+dhR\tau=\frac{1}{\sqrt{d_{h}+d_{h}^{R}}} and compute the attention output for the token at position n−1n-1 as follows: 𝑶n−1,i,:=Softmax⁡(τ𝑸n−1,i,:NoPE(𝑪KV𝑾:,(i)UK)⊤+τ𝑸n−1,i,:RoPE(𝑲RoPE)⊤)(𝑪KV𝑾:,(i)UV),=Softmax⁡(τ𝑸n−1,i,:NoPE(𝑾:,(i)UK)⊤⏟𝑸~n−1,i,:NoPE∈ℝdc(𝑪KV)⊤+τ𝑸n−1,i,:RoPE(𝑲RoPE)⊤)𝑪KV𝑾:,(i)UV,\begin{split}\bm{\mathsfit{O}}_{n-1,i,:}=&\operatorname{Softmax}\left(\tau\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{NoPE}}\left(\bm{C}^{\text{KV}}\bm{W}_{:,(i)}^{\text{UK}}\right)^{\top}+\tau\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{RoPE}}\left(\bm{K}^{\text{RoPE}}\right)^{\top}\right)\left(\bm{C}^{\text{KV}}\bm{W}_{:,(i)}^{\text{UV}}\right),\\ =&\ \operatorname{Softmax}\left(\tau\underbrace{\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{NoPE}}\left(\bm{W}_{:,(i)}^{\text{UK}}\right)^{\top}}_{\tilde{\bm{\mathsfit{Q}}}_{n-1,i,:}^{\text{NoPE}}\in\mathbb{R}^{d_{c}}}\left(\bm{C}^{\text{KV}}\right)^{\top}+\tau\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{RoPE}}\left(\bm{K}^{\text{RoPE}}\right)^{\top}\right)\bm{C}^{\text{KV}}\bm{W}_{:,(i)}^{\text{UV}},\end{split} where 𝑶n−1,i,:\bm{\mathsfit{O}}_{n-1,i,:} is the attention output for head ii at position n−1n-1. We next present a three-step algorithm that leverages the associativity of matrix multiplication to avoid materializing the hh heads of NoPE keys and values, thereby optimizing the decoding efficiency. Step 1 (Query-Side Weight Absorption). We first reorganize the NoPE key and value up-projection matrices into head-wise tensors: 𝑾~i,:,:UK:=(𝑾:,(i)UK)⊤∈ℝdh×dc,𝑾~i,:,:UV:=𝑾:,(i)UV∈ℝdc×dh,\tilde{\bm{\mathsfit{W}}}^{\text{UK}}_{i,:,:}:=\left(\bm{W}^{\text{UK}}_{:,(i)}\right)^{\top}\in\mathbb{R}^{d_{h}\times d_{c}},\qquad\tilde{\bm{\mathsfit{W}}}^{\text{UV}}_{i,:,:}:=\bm{W}^{\text{UV}}_{:,(i)}\in\mathbb{R}^{d_{c}\times d_{h}}, where 𝑾~UK∈ℝh×dh×dc\tilde{\bm{\mathsfit{W}}}^{\text{UK}}\in\mathbb{R}^{h\times d_{h}\times d_{c}} and 𝑾~UV∈ℝh×dc×dh\tilde{\bm{\mathsfit{W}}}^{\text{UV}}\in\mathbb{R}^{h\times d_{c}\times d_{h}}. For the NoPE query at position n−1n-1, 𝑸n−1,:,:NoPE\bm{\mathsfit{Q}}_{n-1,:,:}^{\text{NoPE}}, we absorb the up-projection weight tensor 𝑾~UK\tilde{\bm{\mathsfit{W}}}^{\text{UK}} directly into the query via Einstein summation: 𝑸~n−1,:,:NoPE=einsum("hp,hpc->hc",𝑸n−1,:,:NoPE,𝑾~UK),p=dh,c=dc,𝑸~n−1,:,:NoPE∈ℝh×dc.\tilde{\bm{\mathsfit{Q}}}_{n-1,:,:}^{\text{NoPE}}=\text{einsum}\left(\texttt{"hp,hpc->hc"},\,\bm{\mathsfit{Q}}_{n-1,:,:}^{\text{NoPE}},\,\tilde{\bm{\mathsfit{W}}}^{\text{UK}}\right),\quad p=d_{h},\,c=d_{c},\quad\tilde{\bm{\mathsfit{Q}}}_{n-1,:,:}^{\text{NoPE}}\in\mathbb{R}^{h\times d_{c}}. Step 2 (MQA-Style Decoding on Latent KV Cache). Given the KV cache 𝑪KV\bm{C}^{\text{KV}} and 𝑲RoPE\bm{K}^{\text{RoPE}}, we define the shared key and value tensors by concatenating and reshaping the latent representations as 𝑲~=Reshape⁡(Concat⁡([𝑪KV,𝑲RoPE],dim=1),[n, 1,dc+dhR])∈ℝn×

Papers on Lattice

Total citations

Topics

h-index

Research focus

Architecture Design (Transformers, SSMs, MoE) (1)Distributed Systems & Hardware (1)Inference & Quantization (1)

Frequent co-authors

Songtao Liu (1)Hongwu Peng (1)Zhiwei Zhang (1)Yue Guo (1)

Papers (1)

Mar 2, 2026

Mar 2, 2026·also D slices by indexing into the respective, DeepSeek, Perplexity

Multi-Head Low-Rank Attention

MLRA unlocks 2.8x faster LLM decoding by enabling efficient tensor parallelism for latent attention, sidestepping the memory traffic bottlenecks that plague existing methods.

Songtao Liu, Hongwu Peng, Zhiwei Zhang +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Search

Zhengyu Chen

Research focus

Frequent co-authors

Papers (1)