Search papers, labs, and topics across Lattice.
VistaBot enhances view robustness in end-to-end robotic manipulation by combining feed-forward geometric models for 4D geometry estimation with video diffusion models for view synthesis. It extracts latent representations from synthesized views and uses them for action learning, eliminating the need for camera calibration during testing. Experiments demonstrate a 2.6-2.8x improvement in the newly introduced View Generalization Score (VGS) compared to ACT and $\pi_0$ baselines in both simulated and real-world environments.
Achieve robust robot manipulation across diverse viewpoints without camera calibration by synthesizing novel views with a geometry-aware video diffusion model.
Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($\pi_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $\pi_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.