Search papers, labs, and topics across Lattice.
Apple
1
0
3
2
Forget full KV caches: randomly routing attention across layers during training lets you drastically cut memory without hurting performance, and sometimes even helps.