Search papers, labs, and topics across Lattice.
2
0
6
39
Forget full KV caches: randomly routing attention across layers during training lets you drastically cut memory without hurting performance, and sometimes even helps.
Stop guessing how much to pretrain vs. specialize your language model – scaling laws can now tell you the optimal compute allocation for maximizing performance on downstream tasks.