Mar 19, 2026arXiv:2603.18641

A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

AI Summary

This paper benchmarks catastrophic forgetting mitigation strategies for continual intent classification using the CLINC150 dataset in a 10-task label-disjoint setting. They evaluate ANN, GRU, and Transformer architectures with replay-based (MIR), regularization-based (LwF), and parameter-isolation (HAT) continual learning methods, both individually and in combination. Results show that replay-based methods, especially MIR, are crucial for mitigating forgetting, and that the optimal CL configuration is architecture-dependent, sometimes even surpassing joint training performance.

Key Contribution

Naive fine-tuning leads to catastrophic forgetting, but combining replay-based and parameter isolation strategies can actually *improve* performance over joint training in continual learning for intent classification.

Abstract

Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

Related Papers