Corresponding author are Bo Cheng and SoujanyaApr 16, 2026arXiv:2604.14692

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Zhixuan Wu, Quanxing Zha, Tengfei Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

AI Summary

The paper introduces Chain-of-Glimpse, a framework for video understanding that grounds reasoning steps to specific visual regions, addressing limitations of object-agnostic methods. It uses a reinforcement learning-optimized search-guided controller to iteratively identify and ground task-relevant objects, forming reliable reasoning trajectories. Experiments on NExTQA, Video-Holmes, CG-Bench Reasoning, and VRBench show Chain-of-Glimpse achieves consistent performance gains and improved generalization across diverse video reasoning tasks.

Key Contribution

By explicitly grounding reasoning steps to visual objects, Chain-of-Glimpse enables more accurate and interpretable video understanding, outperforming object-agnostic methods on multiple benchmarks.

Abstract

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...