Fondazione Bruno KesslerUSTBMar 18, 2026arXiv:2603.18082

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Xinyuan Qian, Xinjia Zhu, A. Brutti, Dong Liang

AI Summary

This paper introduces EgoAdapt, a novel framework for "Talking to Me" (TTM) speaker detection in egocentric videos that addresses the challenges of missing modalities, noisy audio, and the need to incorporate head orientation. EgoAdapt uses a Visual Speaker Target Recognition (VSTR) module to capture head orientation and lip movement, a Parallel Shared-weight Audio (PSA) encoder for robust audio feature extraction, and a Visual Modality Missing Awareness (VMMA) module to dynamically adjust for missing visual data. Experiments on the Ego4D dataset show that EgoAdapt outperforms state-of-the-art methods, achieving a 67.39% mAP and 62.01% accuracy.

Key Contribution

Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.

Abstract

TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric"Talking to Me"speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References73

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Related Papers