Search papers, labs, and topics across Lattice.
2
0
3
Transformers may succeed at time series forecasting without relying on the complex superposition that drives their power in NLP, challenging the assumption that these models are leveraging rich compositional representations.
By constraining Transformer architectures to have bounded representations and uniform attention, grokking can be bypassed entirely for modular addition, suggesting task-specific geometric alignment is key.