Search papers, labs, and topics across Lattice.
This paper introduces a query-adaptive framework for audio-visual person retrieval that intelligently detects active modalities to optimize retrieval performance in real-world video archives. By leveraging cross-modal score consistency, the system effectively identifies when either voice or face data is present, avoiding the noise introduced by fusing absent modalities. The proposed method achieves a remarkable 94.2% precision at rank one on the BBC Rewind corpus, significantly surpassing unimodal and fixed fusion approaches, and narrowing the gap to an oracle system with known modality labels.
Query-adaptive detection of active modalities boosts retrieval accuracy by 11.3% over fixed fusion methods in real-world video archives.
When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).