FudanMar 4, 2026arXiv:2603.04128

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Dong Cai, Dongnuan Cai, Henghui Du, Changda Zhou, Chang Zhou, Xi Chen, Dan Guo, Hongyuan Zhang, Xuelong Li, Di Hu

AI Summary

The authors introduce Crab$^{+}$, a scalable Audio-Visual Large Language Model (AV-LLM) designed for unified scene understanding, addressing the problem of negative transfer in multi-task learning. They create AV-UIE v2, a large instruction-tuning dataset with explicit reasoning processes, and propose Interaction-aware LoRA (I-LoRA) to model inter-task relationships and mitigate parameter interference. Experiments demonstrate that Crab$^{+}$ achieves positive transfer in nearly 88% of tasks, outperforming specialized models and reversing the negative transfer trend observed in conventional multi-task unification methods.

Key Contribution

Multi-task AV-LLMs can actually *improve* performance over single-task models, if you carefully design the training data and explicitly model inter-task relationships to avoid negative transfer.

Abstract

Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab$^{+}$, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datasets and 7 tasks, enabling the model to capture cross-task relationships at different levels of granularity. On the model side, we design a unified interface to align heterogeneous task formulations, and propose Interaction-aware LoRA (I-LoRA), which explicitly models inter-task relationships via dynamic routing to coordinate distinct audio-visual interaction patterns, mitigating parameter interference. Extensive experiments show Crab$^{+}$ covers broader tasks than existing unified models while outperforming specialized models on various benchmarks. We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks. These results hold across diverse AV-LLM paradigms and are validated through in-depth visualization, positioning Crab$^{+}$ as a robust step towards holistic audio-visual scene understanding.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References85

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Related Papers