Search papers, labs, and topics across Lattice.
1
0
3
3
Pre-normalization in Transformers is the culprit behind the mysterious link between massive activation outliers and attention sinks, but decoupling them reveals their distinct functions: global parameterization vs. local attention modulation.