Microsoft ResearchUSTCJun 15, 2026arXiv:2606.16184

Closed-Loop Triplet Synergistic Generation for Long-Form Video

Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu

AI Summary

This paper introduces CoTriSyGen, a novel framework for multi-shot long-form video generation that addresses identity drift and inconsistencies by leveraging a closed-loop synergy of visual, text, and memory components. The method employs a vision-language model to iteratively refine prompts and memory through intra-shot and inter-shot updates, ensuring coherence and continuity across video segments. Experimental results on the StoryBench benchmark show significant enhancements in cross-shot consistency and cinematic fluency compared to existing approaches.

Key Contribution

CoTriSyGen achieves unprecedented long-range coherence in video generation by integrating visual evidence into a dynamic memory system, drastically reducing identity drift across shots.

Abstract

Multi-shot long-form video generation remains challenging due to identity drift and compounding inconsistencies across shots. While storyboard-driven pipelines improve controllability, they are often executed in a feed-forward manner, with limited mechanisms to incorporate generated visual evidence back into subsequent conditioning. We propose CoTriSyGen, an agentic framework that formulates multi-shot long video generation as a closed-loop visual-text-memory synergy process, where planned intent, persistent memory, and generated visuals are jointly leveraged for iterative correction and long-range coherence. A vision-language-model-based analyzer reasons over this triplet and produces updates to both prompts and memory along two pathways: (i) intra-shot refinement, which triggers targeted regeneration when semantic or compositional violations are detected and refines image-to-video prompt for coherent motions; and (ii) inter-shot refinement, which rewrites subsequent-shot prompts to propagate newly manifested entities or attributes and improve prompt quality (e.g., compositional grounding and cinematic fluency) based on generated evidence. The loop is grounded in an entity-centric memory modeled as a mutable visual state that evolves as the story progresses, which is continuously updated by both the generator and the analyzer by adding new and evolved entities to reflect appearance changes, accumulated multi-view evidence, and multi-entity compositions. Experiments on our curated StoryBench benchmark demonstrate substantial improvements in cross-shot consistency, prompt adherence, and cinematic continuity over representative methods.

Computer Vision Multimodal Models Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Closed-Loop Triplet Synergistic Generation for Long-Form Video

Related Papers