Feb 25, 2026arXiv:2602.21829

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

AI Summary

The paper introduces StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles using LCS matching to improve semantic alignment in visual storytelling models. This alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Fine-tuning Qwen Storyteller3 on StoryMovie achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment and outperforms Storyteller (trained without script grounding), demonstrating improved dialogue attribution.

Key Contribution

Stop AI models from hallucinating character interactions: StoryMovie aligns visual stories with movie scripts and subtitles, enabling more accurate dialogue attribution and relationship dynamics.

Abstract

Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.

Data Curation & Synthetic Data Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

Related Papers