HKUSTMar 12, 2026arXiv:2603.11421

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, Anyi Rao

AI Summary

The paper introduces ShotVerse, a "Plan-then-Control" framework for text-driven multi-shot video generation that decouples the process into a VLM-based Planner and a Controller. ShotVerse leverages a newly constructed dataset, ShotVerse-Bench, which contains aligned (Caption, Trajectory, Video) triplets obtained through an automated multi-shot camera calibration pipeline. Experiments demonstrate that ShotVerse generates camera-accurate and cross-shot consistent multi-shot videos with superior cinematic aesthetics, effectively bridging the gap between textual control and manual plotting.

Key Contribution

Forget wrestling with finicky text prompts or tedious manual camera paths: ShotVerse lets you generate cinematic multi-shot videos from text, thanks to its clever "Plan-then-Control" framework and a dataset of aligned camera trajectories.

Abstract

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a"Plan-then-Control"framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References64

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Related Papers