Apr 2, 2026arXiv:2604.01966

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Jun Xiao, Junbin Xiao, Sheng Zhang, Shenglang Zhang, Pengxiang Zhu, Peng Zhu, Angela Yao

AI Summary

This paper introduces MyEgo, a new egocentric VideoQA dataset designed to evaluate multimodal LLMs' ability to understand, remember, and reason about the camera wearer ("ego-grounding"). Benchmarking experiments on MyEgo reveal that state-of-the-art MLLMs, including GPT-4V and Qwen3-VL, struggle with personalized question answering, achieving only ~46% and 36% accuracy, respectively. The study also finds that neither explicit reasoning nor model scaling consistently improves performance, and that models struggle to track and remember information about "me" and "my past" over time.

Key Contribution

Even the best multimodal LLMs are surprisingly bad at understanding and remembering the "self" in egocentric video, lagging human performance by 40-50% on personalized question answering.

Abstract

We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs'ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about"my things","my activities", and"my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering"me"and"my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Related Papers