Search papers, labs, and topics across Lattice.
This paper addresses the challenge of Class-Incremental Learning (CIL) in the audio-visual domain by integrating the SAM-Audio model's rich static priors into a continuous learning framework. The authors introduce a guided attention strategy that allows audio features to inform visual representations, alongside dual-level distillation objectives to combat catastrophic forgetting. Their extensive evaluations show that this approach significantly outperforms existing state-of-the-art methods in audio-visual CIL benchmarks, highlighting its effectiveness in maintaining learned knowledge while adapting to new classes.
Integrating audio features into visual learning not only enhances performance but also mitigates catastrophic forgetting in Class-Incremental Learning.
Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio's audio-visual priors into the CIL setting. Specifically, we leverage its dense audio and visual representations and employ a novel guided attention strategy where the audio features contextually guide the visual representations. To further mitigate catastrophic forgetting, we introduce dual-level distillation objectives at both the feature and logit levels. Extensive evaluations on audio-visual CIL benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods.