KAISTJun 10, 2026arXiv:2606.12300

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

AI Summary

This paper introduces ExtremeWhenBench, the first benchmark for natural-language temporal grounding in hour-long videos, highlighting that the primary challenge is search rather than recognition. Through empirical analysis, the authors demonstrate that existing Video-LLMs struggle significantly, with 85% of failures attributed to search limitations, while a frame-level retrieval baseline outperforms these models. A hybrid retrieve-then-ground approach shows a remarkable 6.7x improvement over monolithic Video-LLMs, suggesting a new direction for enhancing temporal grounding in long-form video contexts.

Key Contribution

Video-LLMs fail to effectively ground queries in hour-long videos, with a surprising 85% of their failures stemming from search issues rather than recognition.

Abstract

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

Computer Vision Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...