Microsoft ResearchCambridgeCornellJun 9, 2026arXiv:2606.10944

Express Language Modeling

Albert Gong, Annabelle Michael Carrell, Raaz Dwivedi, Lester Mackey

AI Summary

This paper introduces Express, a novel tool that transforms non-causal attention approximations into causal ones while maintaining matching approximation guarantees. By integrating Express with the Thinformer approximation, the authors achieve a significant reduction in approximation error and memory usage, specifically $\log^{3/2}(n)/s$ error with $O(s)$ memory and $O(s^2 \log^2(n))$ compression overhead for sequences of length $n$. The implementation demonstrates substantial speed improvements over existing methods like FlashAttention 2, addressing critical resource limitations in the language modeling pipeline, including long-context prefill and KV cache compression.

Key Contribution

Express achieves a groundbreaking reduction in approximation error and memory usage for causal attention, outperforming existing methods and enabling more efficient long-context language modeling.

Abstract

We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s^2 \log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Express Language Modeling

Related Papers