Search papers, labs, and topics across Lattice.
The paper introduces HeShare, a framework for energy-aware and efficient multi-task GPU sharing in heterogeneous GPU systems, addressing the challenges of balancing energy efficiency and performance in datacenter environments. HeShare employs an energy-aware task scheduling strategy to optimize task allocation across heterogeneous GPUs and introduces a GPU sharing optimization mechanism that adaptively configures MPS and DVFS settings. Experimental results demonstrate a 26% reduction in average energy costs and a 31% improvement in job completion time compared to state-of-the-art frameworks.
Squeeze 26% more energy efficiency and slash job completion times by 31% in heterogeneous GPU clusters with HeShare's smart task scheduling and adaptive resource management.
With the rapid growth of artificial intelligence and large-scale model computing, the demand for GPUs in datacenters continues to increase, especially for large-scale training and inference tasks. Heterogeneous multi-GPU systems, which integrate GPUs with varying types and computational capabilities, have become critical computing resources. This leads to two main challenges. First, due to the differences in GPU performance and power consumption, task scheduling involves a complex multi-objective optimization to balance energy efficiency and performance. More importantly, the lack of coordinated mechanisms for multi-task sharing and energy-efficient resource management across heterogeneous GPUs can result in GPU overload or underutilization, leading to wasted resources and potential system risks. To address these challenges, we propose HeShare, an energy-aware and efficient heterogeneous GPU framework for datacenters. First, we design an energy-aware task scheduling strategy that optimizes task allocation across different GPUs to achieve a balance between energy consumption and performance. Second, we introduce a GPU sharing optimization mechanism that adaptively configures MPS and DVFS settings for each GPU, enhancing resource utilization, reducing overall energy consumption, and ensuring task performance. Compared to the state-of-the-art framework, we reduce average energy costs by 26% and improve job completion time by 31%, achieving a balance between energy efficiency and performance.