D observations intoDGS-based methods [47Apr 23, 2026arXiv:2604.21914

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

Songen Gu, Yuhang Zheng, Weize Li, Yupeng Zheng, Yating Feng, Xiang Li, Yilun Chen, Pengfei Li, Wenchao Ding

AI Summary

VistaBot enhances view robustness in end-to-end robotic manipulation by combining feed-forward geometric models for 4D geometry estimation with video diffusion models for view synthesis. It extracts latent representations from synthesized views and uses them for action learning, eliminating the need for camera calibration during testing. Experiments demonstrate a 2.6-2.8x improvement in the newly introduced View Generalization Score (VGS) compared to ACT and $\pi_0$ baselines in both simulated and real-world environments.

Key Contribution

Achieve robust robot manipulation across diverse viewpoints without camera calibration by synthesizing novel views with a geometry-aware video diffusion model.

Abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($\pi_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $\pi_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

Related Papers