Search papers, labs, and topics across Lattice.
The paper challenges the necessity of explicit attention mechanisms for global sequence modeling in Transformers. It reframes attention as a dynamically parameterized MLP, suggesting that global modeling arises from implicit context compression rather than explicit token-wise aggregation. The authors then introduce dynamic parameter prediction strategies within standard network layers, achieving Transformer-level performance in vision tasks with linear complexity, effectively replacing explicit attention.
Attention might just be a cleverly disguised MLP: this work shows you can ditch the quadratic complexity and still get Transformer-level performance by dynamically predicting parameters in standard network layers.
Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.