College of Computer ScienceMar 29, 2026arXiv:2603.27706

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

AI Summary

This paper introduces MAR3, a training-free Multi-Agent Recognition, Reasoning, and Reflection framework for Reference Audio-Visual Segmentation (Ref-AVS). MAR3 uses a Consensus Multimodal Recognition mechanism to identify expression difficulty and dominant modality, an adaptive Collaborative Object Reasoning strategy based on these factors, and a Reflective Learning Segmentation mechanism for iterative mask correction. Experiments on Ref-AVSBench show MAR3 outperforms state-of-the-art methods by 3.4% in J&F score, achieving 69.2%.

Key Contribution

LLMs can achieve state-of-the-art audio-visual segmentation without any training by using a multi-agent system that explicitly reasons about expression difficulty and validates segmentation results.

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Related Papers