Mar 10, 2026arXiv:2603.09576

Routing without Forgetting

Alessio Masano, Giovanni Bellitto, Dipam Goswani, Joost Van de Weijer, Concetto Spampinato

AI Summary

The paper introduces Routing without Forgetting (RwF), a novel transformer architecture for online continual learning (OCL) that addresses the limitations of parameter-efficient adaptation methods. RwF uses energy-based associative retrieval layers, inspired by Modern Hopfield Networks, to generate dynamic prompts through single-step retrieval over token embeddings, enabling input-conditioned routing without gradient refinement. Experiments on class-incremental benchmarks, including Split-ImageNet-R and Split-ImageNet-S, demonstrate that RwF significantly outperforms existing prompt-based methods, particularly in few-shot settings.

Key Contribution

Forget gradient descent: this new method routes transformer activations through a Hopfield-inspired memory in a single forward pass to achieve state-of-the-art online continual learning.

Abstract

Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Routing without Forgetting

Related Papers