Search papers, labs, and topics across Lattice.
This paper introduces RL3DEdit, a reinforcement learning framework for multi-view consistent 3D scene editing that leverages 2D diffusion model priors. It addresses the challenge of maintaining consistency across multiple views during 3D editing by using rewards derived from the 3D foundation model VGGT, specifically confidence maps and pose estimation errors, to guide the RL agent. Experiments show that RL3DEdit achieves superior editing quality and multi-view consistency compared to existing methods, while also being more efficient.
Instead of struggling to generate multi-view consistent 3D edits, this paper cleverly uses reinforcement learning to *verify* consistency, unlocking high-quality 3D scene editing from 2D diffusion priors.
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.