CMU MLAdelaide UniversityApr 9, 2026arXiv:2604.08500

Novel View Synthesis as Video Completion

Qi Wu, Qi Wu, Khiem Vuong, Khiem Vuong, Minsik Jeon, Minsik Jeon, Srinivasa Narasimhan, Srinivasa Narasimhan, Deva Ramanan, Deva Ramanan

AI Summary

This paper introduces FrameCrafter, a novel approach to sparse novel view synthesis (NVS) that leverages video diffusion models by reformulating NVS as a low frame-rate video completion task. To handle the unordered nature of sparse NVS inputs, the authors propose architectural modifications to video models, including per-frame latent encodings and removal of temporal positional embeddings, effectively making the models permutation-invariant. Experiments demonstrate that video models can be adapted to NVS with minimal supervision, achieving competitive performance on sparse-view NVS benchmarks.

Key Contribution

Video diffusion models already contain implicit multi-view knowledge, making them surprisingly effective for novel view synthesis when adapted to ignore temporal coherence.

Abstract

We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to"forget"about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Novel View Synthesis as Video Completion

Related Papers