Search papers, labs, and topics across Lattice.
The paper introduces StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles using LCS matching to improve semantic alignment in visual storytelling models. This alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Fine-tuning Qwen Storyteller3 on StoryMovie achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment and outperforms Storyteller (trained without script grounding), demonstrating improved dialogue attribution.
Stop AI models from hallucinating character interactions: StoryMovie aligns visual stories with movie scripts and subtitles, enabling more accurate dialogue attribution and relationship dynamics.
Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.