Search papers, labs, and topics across Lattice.
1
0
2
13
Ditching the strict unit-sum constraint in softmax attention with a simple affine scaling trick unlocks more stable training and better downstream performance for Transformers.