FudanHITNational Key Laboratory of Smart FarmPeng Cheng LaboratoryApr 13, 2026arXiv:2604.11320

CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

Yiran Ling, Wenxuan Li, Siying Dong, Yize Zhang, Xiaoyao Huang, Jing Jiang, Ruonan Li, Jie Liu

AI Summary

This paper introduces CLASP, a closed-loop framework for desktop object grasping that tackles spatial hallucination and open-loop fragility in vision-language models (VLMs). CLASP uses a Dual-Pathway Hierarchical Perception module to decouple semantic intent from geometric grounding, and an Asynchronous Closed-Loop Evaluator for error correction via text-based feedback. Experiments show CLASP achieves 87% success rate, demonstrating strong generalization and robustness in sim-to-real transfer and cluttered environments.

Key Contribution

Achieve near-perfect robot grasping (87% success) by giving robots a "second look" – a closed-loop system that compares pre- and post-action states and uses text feedback to correct errors.

Abstract

Robot grasping of desktop object is widely used in intelligent manufacturing, logistics, and agriculture.Although vision-language models (VLMs) show strong potential for robotic manipulation, their deployment in low-level grasping faces key challenges: scarce high-quality multimodal demonstrations, spatial hallucination caused by weak geometric grounding, and the fragility of open-loop execution in dynamic environments. To address these challenges, we propose Closed-Loop Asynchronous Spatial Perception(CLASP), a novel asynchronous closed-loop framework that integrates multimodal perception, logical reasoning, and state-reflective feedback. First, we design a Dual-Pathway Hierarchical Perception module that decouples high-level semantic intent from geometric grounding. The design guides the output of the inference model and the definite action tuples, reducing spatial illusions. Second, an Asynchronous Closed-Loop Evaluator is implemented to compare pre- and post-execution states, providing text-based diagnostic feedback to establish a robust error-correction loop and improving the vulnerability of traditional open-loop execution in dynamic environments. Finally, we design a scalable multi-modal data engine that automatically synthesizes high-quality spatial annotations and reasoning templates from real and synthetic scenes without human teleoperation. Extensive experiments demonstrate that our approach significantly outperforms existing baselines, achieving an 87.0% overall success rate. Notably, the proposed framework exhibits remarkable generalization across diverse objects, bridging the sim-to-real gap and providing exceptional robustness in geometrically challenging categories and cluttered scenarios.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

Related Papers