Search papers, labs, and topics across Lattice.
2
0
3
Transformers waste up to 56% of their MLP compute on near-linear operations, and selectively replacing nonlinear layers with linear ones can actually *improve* performance.
Certain transformer attention heads act like surprisingly robust Bloom filters, remembering which tokens appeared earlier in the context with impressive accuracy and generalizing beyond just repeated names.