Search papers, labs, and topics across Lattice.
GeoNVS, a novel view synthesis method, addresses geometric distortions and limited camera controllability in video diffusion models by introducing a Gaussian Splat Feature Adapter (GS-Adapter). GS-Adapter lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features in feature space. Experiments show GeoNVS achieves state-of-the-art performance, improving over existing methods by up to 14.9% and significantly reducing translation error and Chamfer Distance.
By adapting diffusion features in 3D Gaussian space, GeoNVS achieves state-of-the-art novel view synthesis with significantly improved geometric fidelity and camera control compared to existing video diffusion models.
Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.