Search papers, labs, and topics across Lattice.
3
0
5
0
Channel-wise adaptive learning rates in Gated Delta Networks unlock superior long-context recall, rivaling softmax attention without the quadratic cost.
Training long-context sparse attention models doesn't have to be a slow, imbalanced mess: SparseBalance achieves 1.33x speedup while *improving* accuracy.
Attention Sink, where Transformers fixate on seemingly irrelevant tokens, is more than just a quirk – it's a fundamental challenge impacting training, inference, and even causing hallucinations, demanding a systematic approach to understanding and mitigating its effects.