Jun 4, 2026arXiv:2606.06479

Pretraining Recurrent Networks without Recurrence

AI Summary

This paper introduces Supervised Memory Training (SMT), a novel method for training recurrent neural networks (RNNs) that eliminates the need for recurrent credit propagation by transforming RNN training into a supervised learning problem based on memory transition labels. By employing a Transformer-based encoder to predict future states from past information, SMT achieves time-parallel training with a stable gradient path, significantly improving the learning of long-range dependencies. The results demonstrate that SMT outperforms traditional backpropagation through time (BPTT) across various tasks, including language modeling and pixel sequence modeling, indicating its potential to enhance the scalability of RNNs in capturing temporal abstractions.

Key Contribution

Supervised Memory Training enables RNNs to learn long-range dependencies more effectively while training in parallel, outperforming traditional methods.

Abstract

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pretraining Recurrent Networks without Recurrence

Related Papers