Jun 4, 2026arXiv:2606.06034

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

Luoming Zhang, Yuwei Ren, Kui Zhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, Liang Zhang

AI Summary

This paper introduces a fast, Matrix Multiplication-based algorithm for approximating matrix inversion in chunk-wise parallel linear attention, specifically targeting strictly lower-triangular matrices. By leveraging a truncated Neumann expansion with structural masking and parallel residual correction, the authors achieve significant improvements in computational efficiency without sacrificing accuracy. Experiments with Qwen3.5-family models reveal a remarkable 5× speedup at the kernel level and a 20% reduction in decode-layer overhead, demonstrating the method's effectiveness for long-context modeling on NPUs.

Key Contribution

Achieving a 5× speedup in kernel-level operations while maintaining accuracy could revolutionize long-context modeling efficiency on NPUs.

Abstract

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

Related Papers