NTUSJTUUESTCUTokyoZJUJun 8, 2026arXiv:2606.09639

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Xiangtai Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

AI Summary

This paper introduces CineDance-1M, a large-scale Text-to-Audio-Video dataset designed to enhance multi-shot, long-form cinematic audio-video generation, addressing the limitations of existing open-source models due to inadequate training data. The dataset features an average length of 92.8 seconds and includes 24.2 continuous shots per video, with structured annotations achieved through a rigorous three-stage curation process that incorporates diverse sourcing, narrative parsing, and dual-modal captioning. The effectiveness of this dataset is validated through the adaptation of LTX-2.3, which showcases high-quality single-modality outputs and precise audio-video alignment, thereby establishing a benchmark for future research in this domain.

Key Contribution

CineDance-1M sets a new standard for open-source cinematic audio-video generation, boasting over 1 million high-quality, structured video samples that could transform the landscape of multimedia AI.

Abstract

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Related Papers