Search papers, labs, and topics across Lattice.
This paper introduces Online Reasoning Video Object Segmentation (ORVOS), a more realistic task setting for video object segmentation where models must make frame-by-frame predictions based only on past and current frames. To facilitate research in this area, the authors present ORVOSB, a new benchmark dataset with causal annotations and referent-shift labels. They also propose a baseline model using continually-updated segmentation prompts and a temporal token reservoir, demonstrating the challenges of online reasoning and establishing a foundation for future work.
Current video object segmentation models fall apart when forced to reason causally, highlighting a critical gap between research benchmarks and real-world applicability.
Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.