Mar 12, 2026arXiv:2603.12055

Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li

AI Summary

The paper introduces Semantic Geometry Preservation for Continual Learning (SeGP-CL), a novel approach to mitigate catastrophic forgetting in vision-language models (VLMs) during continual learning. SeGP-CL identifies drift-prone regions using adversarial anchors generated via dual-targeted projected gradient descent (DPGD) and preserves cross-modal structure through anchor-guided cross-modal geometry distillation (ACGD) and text semantic-geometry regularization (TSGR). Experiments on five continual learning benchmarks demonstrate that SeGP-CL achieves state-of-the-art performance by improving stability, forward transfer, and semantic geometry preservation.

Key Contribution

VLMs forget less when learning new tasks if you explicitly preserve the semantic relationships they learned during pre-training, especially in vulnerable regions identified with adversarial anchors.

Abstract

Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References69

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

Related Papers