Mar 10, 2026arXiv:2603.09731

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

AI Summary

The paper introduces EXPLORE-Bench, a new benchmark for evaluating the long-horizon reasoning capabilities of Multimodal Large Language Models (MLLMs) in egocentric scene prediction. The benchmark uses real first-person videos paired with structured final-scene annotations to assess a model's ability to predict the final scene after a sequence of actions. Experiments on various MLLMs demonstrate a significant performance gap compared to humans, highlighting the challenge of long-horizon egocentric reasoning.

Key Contribution

MLLMs still struggle to reliably predict the long-term consequences of actions in egocentric videos, even with structured scene annotations.

Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Related Papers