YonseiApr 16, 2026arXiv:2604.14648

Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

I. Jeon, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Suhwan Cho, Sangyoun Lee

AI Summary

This paper introduces Seen-to-Scene, a video outpainting framework that combines flow-based propagation with generative diffusion models to expand video content beyond original frame boundaries. The method uses a flow completion network, pre-trained for video inpainting and fine-tuned for motion field reconstruction, along with reference-guided latent propagation to improve temporal coherence and propagation efficiency. Experiments show Seen-to-Scene achieves state-of-the-art results in temporal coherence and visual realism, outperforming existing methods without requiring input-specific adaptation.

Key Contribution

Achieve video outpainting with superior temporal coherence and visual realism by unifying propagation-based and generation-based paradigms.

Abstract

Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

Related Papers