NJUPKUSchool of Computer Science ChinaMar 2, 2026arXiv:2603.01581

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Zihao Zheng, Zhihao Mao, Z. Mao, Maoliang Li, Jiayu Chen, Zhaobo Zhang, Donggang Cao, Donggang Cao, Hong Mei, Hong Mei

AI Summary

The paper introduces KERV, a novel speculative decoding framework for Vision-Language-Action (VLA) models that leverages kinematic-domain prediction to accelerate robot control. KERV uses a kinematics-based Kalman Filter to predict actions and rectify errors in speculative decoding, thereby avoiding expensive re-inference. By dynamically adjusting the acceptance threshold based on kinematic principles, KERV achieves a 27-37% speedup with minimal impact on task success rate across various tasks and environments.

Key Contribution

By integrating kinematic prediction with speculative decoding, KERV enables VLA models to achieve a 27-37% speedup in robot control tasks without sacrificing success rate.

Abstract

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.

Inference & Quantization Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Related Papers