Search papers, labs, and topics across Lattice.
3
0
4
Recurrent memory can be added to transformers at scale with minimal parameter overhead and no performance penalty by reusing existing hidden states and training with interleaved parallel updates.
LLMs can be made far more robust to the position of information in long contexts by simply shuffling the context during fine-tuning.
Forget painstaking hyperparameter tuning: this hypersphere parameterization lets you transfer a single learning rate across model sizes, depths, and even MoE architectures, slashing compute costs by 1.58x.