Search papers, labs, and topics across Lattice.
2
0
3
4
Training massive LLMs on a single GPU is now possible, potentially democratizing access to large-scale model development.
Forget auxiliary losses and fixed expert capacity: Expert Threshold routing dynamically allocates computation in MoEs and balances expert load, all while boosting data efficiency by 1.6x.