Google ResearchUCSCMar 9, 2026arXiv:2603.08648

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yance Jiao, Yan Jiao

AI Summary

The paper introduces Consistent Video Retrieval (CVR), a new task and benchmark to address the limitations of context-agnostic video retrieval methods that neglect state and identity consistency in long-form narratives. To tackle this, they propose CAST (Context-Aware State Transition), a plug-and-play adapter that predicts state-conditioned residual updates from visual history to model latent state evolution. Experiments demonstrate that CAST improves performance on CVR benchmarks and provides a useful reranking signal for video generation.

Key Contribution

Forget local semantic alignment: CAST unlocks temporally coherent video retrieval and generation by explicitly modeling visual state transitions.

Abstract

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Related Papers