Search papers, labs, and topics across Lattice.
This paper introduces the video-to-script (V2S) task, which aims to generate detailed, temporally grounded scripts from long-form cinematic videos, and constructs a corresponding human-annotated benchmark. To address this task, they propose OmniScript, an 8B-parameter audio-visual language model trained with a progressive pipeline involving chain-of-thought supervised fine-tuning and reinforcement learning with temporally segmented rewards. Experiments show that OmniScript outperforms larger open-source models and achieves comparable performance to Gemini 3-Pro in temporal localization and multi-field semantic accuracy.
An 8B model can rival Gemini 3-Pro in generating detailed, temporally-aware scripts from long-form video, proving that targeted training trumps brute force scaling for narrative comprehension.
Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.