Mar 12, 2026arXiv:2603.12193

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

AI Summary

The paper introduces SaPaVe, an end-to-end framework for active perception and manipulation in robotics that unifies semantic camera control with viewpoint-invariant execution. SaPaVe decouples camera and manipulation actions and employs a bottom-up training strategy, first training camera control on the new ActiveViewPose-200K dataset, then jointly optimizing both action types. Experiments on the new ActiveManip-Bench benchmark demonstrate that SaPaVe outperforms existing vision-language-action models, achieving up to 31.25% higher success rates in real-world tasks.

Key Contribution

By decoupling camera and manipulation actions and training them in a coordinated manner, SaPaVe achieves significantly higher success rates in real-world robotic manipulation tasks compared to existing end-to-end vision-language-action models.

Abstract

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(\pi_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References63

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Related Papers