Search papers, labs, and topics across Lattice.
This paper introduces Attention-Spectrum Regularization (ASR), a novel framework for replay-free continual learning in multimodal large language models (MLLMs) that addresses the challenge of catastrophic forgetting during adaptation to new tasks. By treating cross-attention maps as two-dimensional signals and summarizing their properties into spectral statistics, ASR effectively preserves skill-conditioned attention structures without the need for replaying past data. Experimental results demonstrate that ASR significantly enhances performance and reduces forgetting compared to existing methods across various multimodal benchmarks, validating its efficacy in maintaining multimodal skills during continual learning.
Preserving skill-level attention structures in MLLMs can dramatically reduce forgetting while adapting to new tasks without relying on replay mechanisms.
Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at https://github.com/Creative-zcx/attention-spectrum-replay