Search papers, labs, and topics across Lattice.
This paper introduces a fast, Matrix Multiplication-based algorithm for approximating matrix inversion in chunk-wise parallel linear attention, specifically targeting strictly lower-triangular matrices. By leveraging a truncated Neumann expansion with structural masking and parallel residual correction, the authors achieve significant improvements in computational efficiency without sacrificing accuracy. Experiments with Qwen3.5-family models reveal a remarkable 5脳 speedup at the kernel level and a 20% reduction in decode-layer overhead, demonstrating the method's effectiveness for long-context modeling on NPUs.
Achieving a 5脳 speedup in kernel-level operations while maintaining accuracy could revolutionize long-context modeling efficiency on NPUs.
Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.