Search papers, labs, and topics across Lattice.
The paper introduces the Polynomial Mixer (PoM), a linear-complexity token mixing mechanism that replaces self-attention by aggregating tokens into a compact representation via a learned polynomial function. PoM provably maintains the universal sequence-to-sequence approximation property of transformers. Experiments across five domains show that PoM matches attention performance while significantly reducing computational cost on long sequences.
Attention's quadratic scaling problem? Solved: this new Polynomial Mixer (PoM) matches attention performance at linear complexity across diverse tasks.
This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.