BJTUCAUFeb 2, 2026

Cooperative Control Framework for Dual-Arm Robot Enhanced by Vision Language Model and Reinforcement Learning

Member Ieee Guangrong Chen, Qizhe Yang, L. F. I. C. L. Philip Chen, Fellow Ieee Chenguang Yang, Jiehao Li

AI Summary

This paper introduces a hierarchical cooperative control framework for dual-arm robots that combines a vision-language model (VLM) for task planning with online reinforcement learning (RL) for adaptive policy refinement and compliant control for robust execution. The VLM interprets natural language instructions and visual input to generate task plans, while online RL refines manipulation policies under environmental uncertainty. Experiments show the framework increases task success rates from 43% to 100% across tasks like grasping and pouring.

Key Contribution

Dual-arm robots can now learn to perform complex manipulation tasks from natural language instructions, achieving near-perfect success rates by combining vision-language models with online reinforcement learning.

Abstract

This paper presents a cooperative control framework for dual-arm robots that integrates vision-language models (VLMs) with online reinforcement learning (RL) to enhance autonomy and adaptability in complex manipulation tasks. The proposed framework adopts a hierarchical architecture: at the top level, the VLM interprets natural language instructions and visual image to generate task plans; at the middle level, an online RL module refines manipulation policies and ensures adaptive decision-making under environmental uncertainty; and at the bottom level, compliant control based on trajectory planning and impedance regulation enables safe and robust execution. In the feedback, YOLOv5 is used to detect the object, GraspNet is used to obtain the optimal grasp pose, and CLIP (Contrastive Language-Image Pre-Training) is used to judge whether task is completed. Simulations and real-world experiments validate the effectiveness of the proposed method. The dual-arm robot successfully performed various cooperative tasks such as grasping, bottle-cap unscrewing, water pouring, and box carrying, achieving an increase in the task success rate from 43% to 100% with online adaptive learning and training. These results demonstrate that the proposed framework effectively bridges high-level reasoning with low-level control, providing a scalable solution for future applications in service robotics, industrial automation, and human-robot collaboration. Note to Practitioners—This work is motivated by the practical challenge of enabling dual-arm robots to execute complex tasks in unstructured environments such as warehouses, factories, and service settings. Traditional robots often struggle with coordinating both arms, adapting to novel objects, and making real-time decisions. Our framework integrates a VLM for high-level task planning with RL for adaptive execution. The VLM interprets human instructions and environmental cues, while RL enables the robot to refine its performance through trial and error. Practically, this allows the robot to decide when to use the left arm, the right arm, or both arms cooperatively, improving efficiency and flexibility across tasks-such as single-arm grasping of lightweight objects or dual-arm handling of heavier and elongated items. The framework enhances productivity while reducing manual programming effort. Current limitations include reliance on a fixed-depth camera, which can cause occlusions, and the computational cost of online model updates. These are partly mitigated through impedance control and torque feedback, but further improvements in perception and real-time learning are needed. Overall, the approach offers practitioners a pathway toward more versatile and adaptive dual-arm robotic systems for real-world deployment.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueIEEE Transactions on Automation Science and Engineering

Related Papers

Finding related papers...

Search

Cooperative Control Framework for Dual-Arm Robot Enhanced by Vision Language Model and Reinforcement Learning

Related Papers