Mar 16, 2026arXiv:2603.14965

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Minjun Kang, Inkyu Shin, Taeyeop Lee, Myungchul Kim, In So Kweon, Kuk-Jin Yoon

AI Summary

GeoNVS, a novel view synthesis method, addresses geometric distortions and limited camera controllability in video diffusion models by introducing a Gaussian Splat Feature Adapter (GS-Adapter). GS-Adapter lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features in feature space. Experiments show GeoNVS achieves state-of-the-art performance, improving over existing methods by up to 14.9% and significantly reducing translation error and Chamfer Distance.

Key Contribution

By adapting diffusion features in 3D Gaussian space, GeoNVS achieves state-of-the-art novel view synthesis with significantly improved geometric fidelity and camera control compared to existing video diffusion models.

Abstract

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Related Papers