George Washington UniversityKent State UniversityYoungstown State UniversityJul 10, 2025arXiv:2507.07932

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Guilin Zhang, Wulan Guo, Ziqi Tan, Qiang Guan, Hailong Jiang

AI Summary

The paper introduces KIS-S, a framework combining a GPU-aware Kubernetes Inference Simulator (KISim) with a Proximal Policy Optimization (PPO)-based autoscaler (KIScaler) to address the limitations of reactive autoscaling in Kubernetes. KISim provides high-fidelity scheduling emulation with real GPU hardware and Prometheus integration, enabling KIScaler to learn latency-aware and resource-efficient scaling policies in simulation. Experiments across various traffic patterns demonstrate that KIScaler improves moving average reward by 75.2% and reduces P95 latency by up to 6.7x compared to CPU-only baselines, showcasing the effectiveness of simulation-based reinforcement learning for autoscaling GPU inference workloads.

Key Contribution

Ditch the reactive autoscaling: a new RL-powered Kubernetes autoscaler learns to anticipate traffic spikes and optimize GPU inference deployments entirely in simulation.

Abstract

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KISim enables safe, high-fidelity scheduling emulation with real GPU hardware and Prometheus integration, while KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation. KIScaler observes system metrics via Prometheus and adjusts replica counts via the Kubernetes API. We evaluate KIS-S across four synthetic traffic patterns-ramp, periodic, random, and spike-and compare it against conventional baselines including HPA and fixed-resource deployments. Despite training with synthetic feedback due to single-GPU hardware constraints, KIScaler's moving average reward improves from 1.05 to 1.84 (a 75.2 % increase) over 100 training episodes, reduces P95 latency by up to $6.7 \times$ over CPU-only baselines, and generalizes across all traffic patterns without retraining. These results highlight the value of combining simulation and learning, bridging the gap between reactive autoscaling and intelligent orchestration for scalable, GPU-accelerated Kubernetes environments.

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Citation Metrics

Citations3

Influential citations0

References35

Year2025

VenueIEEE International Performance, Computing, and Communications Conference

Related Papers

Finding related papers...

Search

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Related Papers