Search papers, labs, and topics across Lattice.
The paper introduces EgoEverything, a new benchmark for long-context egocentric video understanding in AR environments. The key innovation is the use of human attention signals, derived from gaze data, to generate more behaviorally relevant question-answer pairs. The benchmark comprises over 5,000 multiple-choice questions across 100+ hours of video, offering a more realistic evaluation setting compared to existing datasets.
Current egocentric video benchmarks miss the mark: EgoEverything uses human gaze to create questions that actually reflect how people behave, not just what they see.
Long context egocentric video understanding has recently attracted significant research attention, with augmented reality (AR) highlighted as one of its most important application domains. Nevertheless, the task remains highly challenging due to the need for reasoning over extended temporal contexts and diverse, unstructured activities. Although several benchmarks exist, most egocentric datasets rely on human worn cameras and focus mainly on visual content, with limited consideration of underlying user behavior when forming video-related queries. EgoEverything is a benchmark that explicitly considers human behavior by leveraging human attention signals, abstracted from gaze data, when generating questions. It comprises over 5,000 multiple choice question answer pairs, spanning more than 100 hours of video. By integrating human attention signals during question generation, it more faithfully captures natural human behavior and offers a realistic evaluation setting for long-context egocentric video understanding in AR.