Search papers, labs, and topics across Lattice.
This paper introduces MAVIS, a multi-agent framework that transforms video retrieval from a brute-force search into a cooperative reasoning process, addressing the inefficiencies of traditional embedding-based methods. By creating a Structured Semantic Library that indexes videos at the attribute level and employing a Logic-aware Debate mechanism, MAVIS allows agents to collaboratively refine candidate selections based on user intents. Experimental results on benchmark datasets show that MAVIS achieves competitive performance without the need for task-specific fine-tuning, marking a significant advancement in scalable video retrieval systems.
MAVIS redefines video retrieval by enabling agents to collaboratively reason and refine candidate selections, outperforming traditional methods without task-specific tuning.
The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.