PKUFeb 26, 2026arXiv:2602.22896

DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

Zebin Yang, Yijiahao Qi, Tong Xie, Bo Yu, Shaoshan Liu, Meng Li

AI Summary

The paper introduces DySL-VLA, a framework that reduces the computational cost of Vision-Language-Action (VLA) models for robot manipulation by dynamically skipping layers based on the importance of each action. It employs a prior-post skipping guidance mechanism to determine when to skip "incremental" layers, while always executing "informative" layers. The method is trained using a skip-aware two-stage knowledge distillation algorithm, achieving significant parameter reduction and speedup while maintaining or improving task success.

Key Contribution

Get 3.75x faster VLA inference for robot manipulation without sacrificing accuracy by dynamically skipping layers based on action importance.

Abstract

Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.

Inference & Quantization Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations2

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

Related Papers