Apr 16, 2026arXiv:2604.15127

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Huanran Hu, Huanran Hu, Zihui Ren, Dingyi Yang, Dingyi Yang, Liangyu Chen, Qixiang Gao, Tiezheng Ge, Tiezheng Ge, Qin Jin, Qin Jin

AI Summary

The paper introduces Multimodal Context-to-Script Creation (MCSC), a new task that requires transforming noisy multimodal inputs and user instructions into executable video scripts, encompassing material selection, narrative planning, and script generation. To facilitate research on MCSC, the authors present MCSC-Bench, a large-scale dataset of 11K+ videos with detailed annotations including redundant multimodal materials, user instructions, and production-ready scripts. Experiments demonstrate that current multimodal LLMs struggle with structure-aware reasoning in this context, but models trained on MCSC-Bench achieve state-of-the-art performance and generalize to out-of-domain scenarios.

Key Contribution

Training on MCSC-Bench allows an 8B model to outperform Gemini-2.5-Pro in generating video scripts from noisy multimodal inputs, highlighting the importance of targeted datasets for complex reasoning tasks.

Abstract

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets are available at: https://github.com/huanran-hu/MCSC.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Related Papers