Mar 30, 2026arXiv:2603.28430

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

AI Summary

IsoQuant introduces a hardware-aligned blockwise rotation framework for compressing LLM KV caches, leveraging quaternion algebra and the isoclinic decomposition of SO(4). By representing 4D blocks as quaternions and applying closed-form transforms, IsoQuant reduces the computational cost of forward rotations compared to RotorQuant. Experiments on synthetic data show IsoQuant achieves 4.5-4.7x kernel-level speedups over RotorQuant with comparable reconstruction MSE, suggesting potential for faster and more efficient KV cache compression.

Key Contribution

Forget slow rotations: IsoQuant's quaternion-based approach warps RotorQuant in LLM KV cache compression, delivering up to 6x speedups on synthetic data.

Abstract

Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References15

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Related Papers