Ohio StateMay 30, 2026arXiv:2606.00825

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard A. Newcombe, Hyounghun Kim, Mi Zhang

AI Summary

This paper introduces SuperMemory-VQA, a novel benchmark for egocentric visual question answering that addresses the limitations of existing datasets by focusing on long-horizon memory tasks relevant to real-world applications. The dataset comprises 52.9 hours of recorded activities, featuring diverse modalities such as video, audio, and gaze data, and includes 4,853 human-verified question-answer pairs designed to evaluate memory capabilities in various contexts. Benchmarking current AI systems reveals significant shortcomings in their ability to handle realistic memory tasks, underscoring the necessity for new architectures that can effectively manage grounded memory retrieval.

Key Contribution

Existing AI systems struggle with real-world memory tasks, revealing a critical gap in their ability to serve as effective personal memory assistants.

Abstract

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit"unanswerable"option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Related Papers