Mar 10, 2026arXiv:2603.09078

Exclusive Self Attention

AI Summary

Exclusive Self Attention (XSA) modifies standard self-attention by projecting the attention weights to be orthogonal to the token's value vector, effectively removing self-positional information. This encourages the model to rely more on context. Experiments on language modeling show that XSA outperforms standard self-attention, with larger gains observed for longer sequence lengths and up to 2.7B parameters.

Key Contribution

Transformers get a surprising boost in language modeling performance by simply ignoring "themselves" during attention.

Abstract

We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token's own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References15

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Exclusive Self Attention

Related Papers