Search papers, labs, and topics across Lattice.
Exclusive Self Attention (XSA) modifies standard self-attention by projecting the attention weights to be orthogonal to the token's value vector, effectively removing self-positional information. This encourages the model to rely more on context. Experiments on language modeling show that XSA outperforms standard self-attention, with larger gains observed for longer sequence lengths and up to 2.7B parameters.
Transformers get a surprising boost in language modeling performance by simply ignoring "themselves" during attention.
We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token's own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.