Search papers, labs, and topics across Lattice.
The paper introduces the Latent Recurrent Transformer (LRT), a method that augments autoregressive transformers by reusing a high-level source-layer hidden state from the previous token as recurrent memory for the next. To pretrain this recurrence at scale, they introduce interleaved parallel training, which refines disjoint position subsets in parallel after a single full-sequence initialization. Results show that LRT improves language modeling loss and in-context learning under matched effective compute while adding minimal parameters.
Recurrent memory can be added to transformers at scale with minimal parameter overhead and no performance penalty by reusing existing hidden states and training with interleaved parallel updates.
We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.