Jun 4, 2026arXiv:2606.06294

Towards One-to-Many Temporal Grounding

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

AI Summary

This paper addresses the challenge of One-to-Many Temporal Grounding (OMTG), where multiple disjoint video segments must be localized in response to a single textual query. The authors introduce a comprehensive OMTG benchmark and a high-quality dataset of 56k samples, alongside novel temporal and caption reward functions that enhance model performance. Their approach achieves a new state-of-the-art Effective Temporal F1 score of 43.65%, significantly outperforming existing models like Gemini 2.5 Pro and Seed-1.8 by over 15%.

Key Contribution

Achieving a 43.65% Effective Temporal F1 score, this work reveals that MLLMs can be effectively adapted for complex One-to-Many Temporal Grounding tasks, challenging the limitations of previous models.

Abstract

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards One-to-Many Temporal Grounding

Related Papers