Search papers, labs, and topics across Lattice.
The paper introduces R2RDreamer, a novel framework for enhancing spatial generalization in imitation-learned manipulation policies by leveraging real-to-real demonstration augmentation. By performing lightweight 3D augmentation on incomplete object pointclouds and action trajectories, R2RDreamer effectively generates temporally coherent RGB observations in 2D video space, thus addressing the limitations of traditional simulation-based methods. Experiments demonstrate significant improvements in spatial generalization across various manipulation tasks, validating the effectiveness of the proposed 3D editing and occlusion-aware projection techniques.
R2RDreamer achieves superior spatial generalization for 2D manipulation policies by transforming 3D action-observation pairs into coherent 2D video data, all while minimizing the sim-to-real gap.
Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.