CambridgeCiscoInstitute of Medical TechnologyPKUQueen'sSouthwest Jiaotong UniversityTeesside UniversityUniversity of SurreyJun 4, 2026arXiv:2606.05931

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

AI Summary

This paper introduces a query-adaptive framework for audio-visual person retrieval that intelligently detects active modalities to optimize retrieval performance in real-world video archives. By leveraging cross-modal score consistency, the system effectively identifies when either voice or face data is present, avoiding the noise introduced by fusing absent modalities. The proposed method achieves a remarkable 94.2% precision at rank one on the BBC Rewind corpus, significantly surpassing unimodal and fixed fusion approaches, and narrowing the gap to an oracle system with known modality labels.

Key Contribution

Query-adaptive detection of active modalities boosts retrieval accuracy by 11.3% over fixed fusion methods in real-world video archives.

Abstract

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

Multimodal Models Recommendation & Information Retrieval Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Related Papers