PurdueMar 9, 2026arXiv:2603.08967

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing Zhu

AI Summary

This paper introduces a new exemplar-free continual learning benchmark for audio-visual segmentation (AVS) to address the challenge of evolving audio and visual distributions in real-world environments. The benchmark includes four learning protocols across single-source and multi-source AVS datasets. The authors also propose a strong baseline model, ATLAS, that uses audio-guided pre-fusion conditioning and mitigate catastrophic forgetting with Low-Rank Anchoring (LRA).

Key Contribution

A new benchmark reveals how existing audio-visual segmentation models crumble when faced with the dynamic, ever-changing audio and visual environments of the real world.

Abstract

Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}

Computer Vision Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Related Papers