Search papers, labs, and topics across Lattice.
This paper introduces LaplacianFormer, a Transformer variant using a Laplacian kernel as a theoretically grounded alternative to softmax attention, addressing the limitations of Gaussian kernel approximations. To mitigate expressiveness loss from low-rank approximations, they propose an injective feature map that preserves token information. By combining a Nystr枚m approximation with Newton-Schulz iteration and custom CUDA implementations, LaplacianFormer achieves competitive performance-efficiency trade-offs on ImageNet.
Laplacian kernels offer a theoretically sound and empirically superior alternative to Gaussian kernels for linear attention, finally bridging the gap between efficient computation and expressive power in Transformers.
The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nystr\"om approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.