HeyGen ResearchApr 30, 2026arXiv:2604.27975

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, V. Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong

AI Summary

This paper introduces Shot Transition Detection (STD) to explicitly detect continuous temporal segments of transitions in videos, addressing the limitations of traditional Shot Boundary Detection (SBD). They propose TransVLM, a Vision-Language Model (VLM) framework that incorporates optical flow as a motion prior to enhance temporal awareness. Experiments show TransVLM outperforms existing methods, including traditional heuristics, spatiotemporal networks, and other VLMs, and has been deployed to production.

Key Contribution

Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.

Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Related Papers