UWCUHKHKUSTUW-MadisonMay 26, 2026arXiv:2605.26797

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang, Baolin Peng, Hao Cheng, Shuohang Wang, Pengcheng He, Jianfeng Gao, Yelong Shen

AI Summary

The paper introduces the Latent Recurrent Transformer (LRT), a method that augments autoregressive transformers by reusing a high-level source-layer hidden state from the previous token as recurrent memory for the next. To pretrain this recurrence at scale, they introduce interleaved parallel training, which refines disjoint position subsets in parallel after a single full-sequence initialization. Results show that LRT improves language modeling loss and in-context learning under matched effective compute while adding minimal parameters.

Key Contribution

Recurrent memory can be added to transformers at scale with minimal parameter overhead and no performance penalty by reusing existing hidden states and training with interleaved parallel updates.

Abstract

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

Related Papers