Mar 9, 2026arXiv:2603.07866

Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

Dilermando Almeida, Juliano Negri, Guilherme Lazzarini, Thiago H. Segreto, Ranulfo Bezerra, Ricardo V. Godoy, Marcelo Becker

AI Summary

This paper introduces an end-to-end pipeline for language-guided robotic grasping that addresses challenges posed by partial observations and cluttered environments. The pipeline uses open-vocabulary detection and promptable instance segmentation to ground language commands in RGB images, enhances geometric reliability through depth compensation and point cloud completion, and generates collision-filtered 6-DoF grasp candidates. Evaluated on a quadruped robot, the proposed approach achieved a 90% grasp success rate, significantly outperforming a view-dependent baseline at 30%.

Key Contribution

A quadruped robot can now reliably grasp objects in cluttered scenes from partial observations, thanks to a new pipeline that combines language grounding, point cloud completion, and safety-oriented grasp selection.

Abstract

Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

Related Papers