Search papers, labs, and topics across Lattice.
1
0
3
By cleverly initializing sparse attention with on-chip histograms, AdaSplash-2 achieves comparable or better training speed than FlashAttention-2 at moderate-to-high sparsity, unlocking the potential of $\alpha$-entmax for long-context transformers.