Search papers, labs, and topics across Lattice.
D tensors: 𝑸:,i,:NoPE:=𝑸NoPE[:,i,:],𝑸:,i,:RoPE:=𝑸RoPE[:,i,:],𝑲:,i,:NoPE:=𝑲NoPE[:,i,:],𝑽:,i,::=𝑽[:,i,:].\bm{\mathsfit{Q}}^{\text{NoPE}}_{:,i,:}:=\bm{\mathsfit{Q}}^{\text{NoPE}}\left[:,\,i,\,:\right],\ \bm{\mathsfit{Q}}^{\text{RoPE}}_{:,i,:}:=\bm{\mathsfit{Q}}^{\text{RoPE}}\left[:,\,i,\,:\right],\ \bm{\mathsfit{K}}^{\text{NoPE}}_{:,i,:}:=\bm{\mathsfit{K}}^{\text{NoPE}}\left[:,\,i,\,:\right],\ \bm{\mathsfit{V}}_{:,i,:}:=\bm{\mathsfit{V}}\left[:,\,i,\,:\right]. To incorporate positional information, MLA shares a common RoPE key 𝑲RoPE\bm{K}^{\text{RoPE}} across all attention heads. The final position-aware query and key for head ii are formed by concatenating their respective NoPE and RoPE components: 𝑸:,i,:=Concat([𝑸:,i,:NoPE,𝑸:,i,:RoPE],dim=1),𝑲:,i,:=Concat([𝑲:,i,:NoPE,𝑲RoPE],dim=1).\bm{\mathsfit{Q}}_{:,i,:}=\operatorname{Concat}\left(\left[\bm{\mathsfit{Q}}_{:,i,:}^{\text{NoPE}},\,\bm{\mathsfit{Q}}_{:,i,:}^{\text{RoPE}}\right],\,\text{dim=1}\right),\quad\bm{\mathsfit{K}}_{:,i,:}=\operatorname{Concat}\left(\left[\bm{\mathsfit{K}}_{:,i,:}^{\text{NoPE}},\,\bm{K}^{\text{RoPE}}\right],\,\text{dim=1}\right). Efficient Decoding. Recall that MLA utilizes up-projection matrices 𝑾UK\bm{W}^{\text{UK}} and 𝑾UV\bm{W}^{\text{UV}}. We extract the dhd_{h}-column slices for head ii to define its head-specific projections: 𝑾:,(i)UK:=𝑾UK[:,idh:(i+1)dh],𝑾:,(i)UV:=𝑾UV[:,idh:(i+1)dh].\bm{W}_{:,(i)}^{\text{UK}}:=\bm{W}^{\text{UK}}\left[:,\,id_{h}:(i+1)d_{h}\right],\quad\bm{W}_{:,(i)}^{\text{UV}}:=\bm{W}^{\text{UV}}\left[:,\,id_{h}:(i+1)d_{h}\right]. We refer to the DeepSeek official inference implementation (DeepSeek and others, 2024c) to illustrate how to “absorb” up-projection matrices into the queries to avoid explicit KV materialization in MLA decoding. For the prefix sequence {0,…,n−1}\{0,\ldots,n-1\} with cached components 𝑪KV\bm{C}^{\text{KV}} and 𝑲RoPE\bm{K}^{\text{RoPE}}, we define the head-wise up-projections by partitioning 𝑾UK\bm{W}^{\text{UK}} and 𝑾UV\bm{W}^{\text{UV}} into hh heads, {𝑾:,(i)UK}i=0h−1\{\bm{W}_{:,(i)}^{\text{UK}}\}_{i=0}^{h-1} and {𝑾:,(i)UV}i=0h−1\{\bm{W}_{:,(i)}^{\text{UV}}\}_{i=0}^{h-1}. For the last prefix token at position n−1n-1, let 𝑸n−1,i,:\bm{\mathsfit{Q}}_{n-1,i,:} denote the query vector for the ii-th attention head. To maintain variance during the dot-product operation, we apply the scaling factor τ=1dh+dhR\tau=\frac{1}{\sqrt{d_{h}+d_{h}^{R}}} and compute the attention output for the token at position n−1n-1 as follows: 𝑶n−1,i,:=Softmax(τ𝑸n−1,i,:NoPE(𝑪KV𝑾:,(i)UK)⊤+τ𝑸n−1,i,:RoPE(𝑲RoPE)⊤)(𝑪KV𝑾:,(i)UV),=Softmax(τ𝑸n−1,i,:NoPE(𝑾:,(i)UK)⊤⏟𝑸~n−1,i,:NoPE∈ℝdc(𝑪KV)⊤+τ𝑸n−1,i,:RoPE(𝑲RoPE)⊤)𝑪KV𝑾:,(i)UV,\begin{split}\bm{\mathsfit{O}}_{n-1,i,:}=&\operatorname{Softmax}\left(\tau\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{NoPE}}\left(\bm{C}^{\text{KV}}\bm{W}_{:,(i)}^{\text{UK}}\right)^{\top}+\tau\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{RoPE}}\left(\bm{K}^{\text{RoPE}}\right)^{\top}\right)\left(\bm{C}^{\text{KV}}\bm{W}_{:,(i)}^{\text{UV}}\right),\\ =&\ \operatorname{Softmax}\left(\tau\underbrace{\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{NoPE}}\left(\bm{W}_{:,(i)}^{\text{UK}}\right)^{\top}}_{\tilde{\bm{\mathsfit{Q}}}_{n-1,i,:}^{\text{NoPE}}\in\mathbb{R}^{d_{c}}}\left(\bm{C}^{\text{KV}}\right)^{\top}+\tau\bm{\mathsfit{Q}}_{n-1,i,:}^{\text{RoPE}}\left(\bm{K}^{\text{RoPE}}\right)^{\top}\right)\bm{C}^{\text{KV}}\bm{W}_{:,(i)}^{\text{UV}},\end{split} where 𝑶n−1,i,:\bm{\mathsfit{O}}_{n-1,i,:} is the attention output for head ii at position n−1n-1. We next present a three-step algorithm that leverages the associativity of matrix multiplication to avoid materializing the hh heads of NoPE keys and values, thereby optimizing the decoding efficiency. Step 1 (Query-Side Weight Absorption). We first reorganize the NoPE key and value up-projection matrices into head-wise tensors: 𝑾~i,:,:UK:=(𝑾:,(i)UK)⊤∈ℝdh×dc,𝑾~i,:,:UV:=𝑾:,(i)UV∈ℝdc×dh,\tilde{\bm{\mathsfit{W}}}^{\text{UK}}_{i,:,:}:=\left(\bm{W}^{\text{UK}}_{:,(i)}\right)^{\top}\in\mathbb{R}^{d_{h}\times d_{c}},\qquad\tilde{\bm{\mathsfit{W}}}^{\text{UV}}_{i,:,:}:=\bm{W}^{\text{UV}}_{:,(i)}\in\mathbb{R}^{d_{c}\times d_{h}}, where 𝑾~UK∈ℝh×dh×dc\tilde{\bm{\mathsfit{W}}}^{\text{UK}}\in\mathbb{R}^{h\times d_{h}\times d_{c}} and 𝑾~UV∈ℝh×dc×dh\tilde{\bm{\mathsfit{W}}}^{\text{UV}}\in\mathbb{R}^{h\times d_{c}\times d_{h}}. For the NoPE query at position n−1n-1, 𝑸n−1,:,:NoPE\bm{\mathsfit{Q}}_{n-1,:,:}^{\text{NoPE}}, we absorb the up-projection weight tensor 𝑾~UK\tilde{\bm{\mathsfit{W}}}^{\text{UK}} directly into the query via Einstein summation: 𝑸~n−1,:,:NoPE=einsum("hp,hpc->hc",𝑸n−1,:,:NoPE,𝑾~UK),p=dh,c=dc,𝑸~n−1,:,:NoPE∈ℝh×dc.\tilde{\bm{\mathsfit{Q}}}_{n-1,:,:}^{\text{NoPE}}=\text{einsum}\left(\texttt{"hp,hpc->hc"},\,\bm{\mathsfit{Q}}_{n-1,:,:}^{\text{NoPE}},\,\tilde{\bm{\mathsfit{W}}}^{\text{UK}}\right),\quad p=d_{h},\,c=d_{c},\quad\tilde{\bm{\mathsfit{Q}}}_{n-1,:,:}^{\text{NoPE}}\in\mathbb{R}^{h\times d_{c}}. Step 2 (MQA-Style Decoding on Latent KV Cache). Given the KV cache 𝑪KV\bm{C}^{\text{KV}} and 𝑲RoPE\bm{K}^{\text{RoPE}}, we define the shared key and value tensors by concatenating and reshaping the latent representations as 𝑲~=Reshape(Concat([𝑪KV,𝑲RoPE],dim=1),[n, 1,dc+dhR])∈ℝn×
1
0
3
2
MLRA unlocks 2.8x faster LLM decoding by enabling efficient tensor parallelism for latent attention, sidestepping the memory traffic bottlenecks that plague existing methods.