Search papers, labs, and topics across Lattice.
1
0
3
16
Distilling language models just got more efficient: a new loss function focuses on the long tail of token probabilities, boosting performance without extra compute.