Search papers, labs, and topics across Lattice.
ConfCtrl, a novel video interpolation framework, tackles the challenge of novel view synthesis from two images with large viewpoint changes by enabling diffusion models to follow prescribed camera poses while completing unseen regions. It initializes the diffusion process with a confidence-weighted projected point cloud latent combined with noise and employs a Kalman-inspired predict-update mechanism to balance pose-driven predictions with noisy geometric observations. Experiments demonstrate that ConfCtrl generates geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions.
By fusing confidence-weighted point cloud projections with a Kalman-inspired update mechanism, ConfCtrl enables diffusion models to generate geometrically consistent novel views from sparse inputs, even under significant viewpoint shifts.
We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.