Tsinghua AIGreat Bay UniversityInternational Graduate SchoolNankai UniversityUTokyoJun 8, 2026arXiv:2606.09641

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo

AI Summary

This paper introduces MAVIS, a multi-agent framework that transforms video retrieval from a brute-force search into a cooperative reasoning process, addressing the inefficiencies of traditional embedding-based methods. By creating a Structured Semantic Library that indexes videos at the attribute level and employing a Logic-aware Debate mechanism, MAVIS allows agents to collaboratively refine candidate selections based on user intents. Experimental results on benchmark datasets show that MAVIS achieves competitive performance without the need for task-specific fine-tuning, marking a significant advancement in scalable video retrieval systems.

Key Contribution

MAVIS redefines video retrieval by enabling agents to collaboratively reason and refine candidate selections, outperforming traditional methods without task-specific tuning.

Abstract

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

Related Papers