Search papers, labs, and topics across Lattice.
EcoShift is introduced, a cluster-wide power management framework for heterogeneous CPU-GPU systems under power constraints. It uses online performance prediction to model application sensitivity to CPU and GPU power caps, then employs dynamic programming to optimally allocate reclaimed power. Emulation results on Intel-NVIDIA platforms show EcoShift improves average performance by up to 6% compared to fair-share and utilization-based policies, while respecting cluster power limits.
Squeezing more performance from power-constrained CPU-GPU clusters is now possible: EcoShift dynamically allocates reclaimed power based on application-specific sensitivity, boosting performance by up to 6%.
Power-constrained HPC systems increasingly run heterogeneous CPU--GPU applications under strict cluster-wide power limits. Existing cluster-wide power management policies rely on fair-share or utilization heuristics and do not capture application-specific sensitivity to CPU and GPU power caps, leading to inefficient use of reclaimed power. We present EcoShift, a performance-aware cluster-wide power management framework. EcoShift combines online performance prediction with a dynamic-programming-based allocator to distribute reclaimed power across CPU--GPU applications for maximum average performance improvement. Through emulation-based evaluation on two heterogeneous Intel CPU and NVIDIA A100/H100 GPU platforms with diverse CPU--GPU workloads, EcoShift consistently outperforms state-of-the-art policies, achieving up to 6% average performance improvement while preserving the cluster-wide power constraint.