Tsinghua AINational Supercomputing Center in ShenzhenApr 16, 2026arXiv:2604.14779

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

Peifeng Zhang, Zice Qiu, Donghua Yu, Dong Yu, Shilei Cao, Juepeng Zheng, Haohuan Fu

AI Summary

The paper addresses catastrophic forgetting in continual learning for VQA tasks using Vision-Language Models (VLMs), which have asymmetric architectures. It identifies that global regularization favors the language decoder, leading to forgetting in the visual projection layers and a loss of compositional reasoning. To mitigate this, they introduce Asymmetric Information Masking (AIM), which applies modality-specific masks based on sensitivity to balance stability and plasticity. AIM achieves state-of-the-art performance on VQA v2 and GQA benchmarks in continual VQA settings.

Key Contribution

VLMs forget visual reasoning skills in continual learning because today's methods over-protect the language model while neglecting the vision encoder.

Abstract

In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

Related Papers