BUPTD observations intoPolyUApr 15, 2026arXiv:2604.13942

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

Xinyu Ning, Xinyu Ning, Zhe Hu, Zhengyu Hu, Xinxin Xie, Xinxin Xie, Weize Li, Zhipeng Tang, Chongyu Wang, Chongyu Wang, Zejun Yang, Zejun Yang, Hanlin Wang, Yitong Liu, Zhongzhu Pu, Zhongzhu Pu

AI Summary

Goal2Skill introduces a dual-system framework for long-horizon embodied manipulation that separates high-level semantic reasoning (planning) from low-level motor execution (control). A VLM-based planner manages task memory, decomposes goals, verifies outcomes, and corrects errors, while a VLA-based executor uses diffusion to generate actions based on filtered observations. Experiments on RMBench show Goal2Skill significantly outperforms baselines, achieving a 32.4% success rate versus 9.8% for the best baseline, demonstrating the importance of structured memory and closed-loop recovery.

Key Contribution

Achieve 3x higher success in long-horizon robotic manipulation by explicitly separating high-level planning from low-level control, enabling memory-aware reasoning and adaptive replanning.

Abstract

Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning, adaptive replanning, and robust online recovery. Experiments on representative RMBench tasks show that the proposed framework substantially outperforms representative baselines, achieving a 32.4% average success rate compared with 9.8% for the strongest baseline. Ablation studies further confirm the importance of structured memory and closed-loop recovery for long-horizon manipulation.

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

Related Papers