BeihangLi AutoNUDTTJUJan 1, 2026

A Hierarchical Vision-Language and Reinforcement Learning Framework for Robotic Task and Motion Planning in Collaborative Manipulation

Junnan Zhang, Chaoxu Mu, Xin Xu, Lei Ren

AI Summary

This paper introduces VL-RL, a hierarchical framework combining a vision-language (VL) planner for high-level task planning with a reinforcement learning (RL)-based motion planner for low-level control. The VL planner handles visual perception and semantic understanding, while the RL planner provides flexibility and adaptability to environmental changes during task execution. Experiments demonstrate that VL-RL achieves more efficient and stable dual-robot collaborative manipulation, particularly in dynamic grasping and long-horizon complex tasks, compared to end-to-end vision-language-action models.

Key Contribution

Robots get a boost in adaptability and speed thanks to a new hierarchical framework that lets them react to changing environments on the fly, without needing to re-plan the whole task.

Abstract

Vision-language-action models (VLAs) use an end-to-end learning architecture, which can realize the integration of visual perception, semantic understanding and motion control. However, when tackling with the dynamic or long-horizon tasks, VLAs have poor robustness and real-time adjustment ability against changes in target objects, instructions, and environments. To handles these limitations, we propose VL-RL, a hierarchical framework that consists of the vision-language (VL) planner that owns excellent VL information understanding and high-level task planning abilities and reinforcement learning (RL)-based low-level motion planner with enhanced flexibility and broader applicability. If the environmental state changes during task execution, the RL planner in VL-RL will directly make dynamic adjustments at the subtask level based on visual feedback to achieve the task goals, without the need for time-consuming information processing from VL planner. Experiments demonstrate that VL-RL can more efficiently and stably complete dual-robot collaborative manipulation tasks. Finally, our work is verified by dynamic grasping tasks and long-horizon complex tasks.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueIEEE Robotics and Automation Letters

Related Papers

Finding related papers...

Search

A Hierarchical Vision-Language and Reinforcement Learning Framework for Robotic Task and Motion Planning in Collaborative Manipulation

Related Papers