Mar 4, 2026arXiv:2603.04349

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin

AI Summary

FocusGraph addresses the challenge of long video question answering by introducing a two-stage keyframe selection process: first, a Scene-Caption LLM Selector identifies relevant clips based on graph-based scene descriptions, and then a Patch-wise Sparse-Flow Retention (PSFR) method selects keyframes from these clips. This approach avoids directly processing the entire video frame sequence, mitigating performance degradation and inference time increases associated with long video inputs to MLLMs. Experiments on FindingDory and HourVideo datasets demonstrate that FocusGraph achieves state-of-the-art results while significantly reducing inference time compared to baselines.

Key Contribution

MLLMs choke on long videos, but FocusGraph's graph-structured summarization lets them ace question answering while slashing inference time.

Abstract

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Related Papers