Search papers, labs, and topics across Lattice.
The paper introduces KIS-S, a framework combining a GPU-aware Kubernetes Inference Simulator (KISim) with a Proximal Policy Optimization (PPO)-based autoscaler (KIScaler) to address the limitations of reactive autoscaling in Kubernetes. KISim provides high-fidelity scheduling emulation with real GPU hardware and Prometheus integration, enabling KIScaler to learn latency-aware and resource-efficient scaling policies in simulation. Experiments across various traffic patterns demonstrate that KIScaler improves moving average reward by 75.2% and reduces P95 latency by up to 6.7x compared to CPU-only baselines, showcasing the effectiveness of simulation-based reinforcement learning for autoscaling GPU inference workloads.
Ditch the reactive autoscaling: a new RL-powered Kubernetes autoscaler learns to anticipate traffic spikes and optimize GPU inference deployments entirely in simulation.
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KISim enables safe, high-fidelity scheduling emulation with real GPU hardware and Prometheus integration, while KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation. KIScaler observes system metrics via Prometheus and adjusts replica counts via the Kubernetes API. We evaluate KIS-S across four synthetic traffic patterns-ramp, periodic, random, and spike-and compare it against conventional baselines including HPA and fixed-resource deployments. Despite training with synthetic feedback due to single-GPU hardware constraints, KIScaler's moving average reward improves from 1.05 to 1.84 (a 75.2 % increase) over 100 training episodes, reduces P95 latency by up to $6.7 \times$ over CPU-only baselines, and generalizes across all traffic patterns without retraining. These results highlight the value of combining simulation and learning, bridging the gap between reactive autoscaling and intelligent orchestration for scalable, GPU-accelerated Kubernetes environments.