Search papers, labs, and topics across Lattice.
The paper introduces DyQ-VLA, a dynamic quantization framework tailored for Vision-Language-Action (VLA) models in embodied AI, addressing the limitations of static quantization by considering the temporal-dynamic sensitivity of VLA models. DyQ-VLA uses a sensitivity-aware switching strategy based on real-time kinematic proxies to trigger bit-width adjustments and a kinematic-guided module for dynamic bit allocation. Experiments demonstrate that DyQ-VLA achieves significant memory reduction (down to 30.9% of the original) with minimal performance loss (99.5% of original) and speedups in both simulation (1.49x) and real-world scenarios (up to 1.43x).
Squeeze your embodied AI models: DyQ-VLA cuts memory footprint by 70% and speeds up inference by 40% without sacrificing performance, all by dynamically adjusting bit-widths based on real-time kinematic data.
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.