LMUMar 10, 2026arXiv:2603.09819

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav Valada

AI Summary

ConfCtrl, a novel video interpolation framework, tackles the challenge of novel view synthesis from two images with large viewpoint changes by enabling diffusion models to follow prescribed camera poses while completing unseen regions. It initializes the diffusion process with a confidence-weighted projected point cloud latent combined with noise and employs a Kalman-inspired predict-update mechanism to balance pose-driven predictions with noisy geometric observations. Experiments demonstrate that ConfCtrl generates geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions.

Key Contribution

By fusing confidence-weighted point cloud projections with a Kalman-inspired update mechanism, ConfCtrl enables diffusion models to generate geometrically consistent novel views from sparse inputs, even under significant viewpoint shifts.

Abstract

We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

Related Papers