Mar 15, 2026arXiv:2603.14371

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

AI Summary

The paper introduces OxyGen, a unified KV cache management system designed to improve the efficiency of multi-task Vision-Language-Action Models (VLAs) on edge devices. OxyGen treats the KV cache as a shared resource across tasks and time, enabling cross-task KV sharing to eliminate redundant prefill and cross-frame continuous batching to decouple language decoding from action generation. Experiments with the $π_{0.5}$ MoT VLA demonstrate up to 3.7x speedup compared to isolated execution, achieving high language throughput and action frequency concurrently.

Key Contribution

Get 3.7x faster multi-task VLA inference on-device by unifying KV cache management across tasks and time.

Abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for $π_{0.5}$, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

Distributed Systems & Hardware Inference & Quantization Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Related Papers