Department of Computer ScienceNJUVirginia TechMay 27, 2026arXiv:2605.28809

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

AI Summary

This paper tackles catastrophic forgetting in CLIP-based Class-Incremental Learning (CIL) by explicitly modeling attribute extraction and aggregation as distinct stages. They propose AREA, which stabilizes attribute extraction by anchoring class-level visual and textual attributes using principal geodesic analysis on a hypersphere. AREA also stabilizes aggregation by learning task-specific experts with variational information bottleneck regularization and uses optimal transport for routing during inference, achieving state-of-the-art results.

Key Contribution

Decomposing CLIP-based classification into attribute extraction and aggregation reveals a new path to mitigating catastrophic forgetting in class-incremental learning.

Abstract

Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo of a [CLASS]''. This seemingly monolithic matching process can be decomposed into two conceptually distinct stages: attribute extraction and attribute aggregation. For example, a model may recognize cat using attributes such as fur texture and whiskers. When learning a new class like car, the model must extract additional attributes like wheels and adjust how they are aggregated in the shared representation space. However, since only data from the current task is available, incremental updates can bias both attribute extraction and aggregation toward new classes, leading to catastrophic forgetting. Therefore, we propose AREA for attribute extraction and aggregation in CLIP-based CIL. To stabilize extraction, we anchor class-level visual and textual attributes on the hyperspherical embedding space via principal geodesic analysis. To stabilize aggregation, we learn lightweight task-specific experts with scoring and residual refinement, regularized by a variational information bottleneck objective. During inference, we perform routing over task attribute manifolds via optimal transport for more concise prediction. Experiments show that AREA consistently outperforms SOTA methods. Code is available at https://github.com/LAMDA-CL/ICML2026-AREA.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

Related Papers