ImperialMay 27, 2026arXiv:2605.28486

Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation

Yongchen Wang, Kangyi Lu, Lan Wei, Dandan Zhang

AI Summary

The paper introduces Mag-VLA, a vision-language-action model that uses a Qwen2.5-VL-7B backbone fine-tuned with LoRA, to control bimanual magnetically actuated microrobots. To address the challenges of bimanual coordination and temporal coherence, they incorporate a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder. Experiments on a newly constructed teleoperated dataset demonstrate that Mag-VLA achieves high success rates in real-world microrobot manipulation tasks, outperforming alternative action generation methods.

Key Contribution

Bimanual microrobot manipulation, previously limited by coupled control challenges, becomes surprisingly effective with a hierarchical vision-language-action model.

Abstract

Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation

Related Papers