Search papers, labs, and topics across Lattice.
The paper introduces DySL-VLA, a framework that reduces the computational cost of Vision-Language-Action (VLA) models for robot manipulation by dynamically skipping layers based on the importance of each action. It employs a prior-post skipping guidance mechanism to determine when to skip "incremental" layers, while always executing "informative" layers. The method is trained using a skip-aware two-stage knowledge distillation algorithm, achieving significant parameter reduction and speedup while maintaining or improving task success.
Get 3.75x faster VLA inference for robot manipulation without sacrificing accuracy by dynamically skipping layers based on action importance.
Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.