Search papers, labs, and topics across Lattice.
This paper presents a large-scale empirical study of inference time and energy consumption across 46 generative AI models, 7 tasks, and 1,858 configurations on NVIDIA H100 and B200 GPUs, revealing order-of-magnitude variations based on task type, modality, and GPU utilization. The authors propose a diagnostic framework that decomposes energy consumption into latent metrics like memory usage and GPU utilization, which are influenced by factors across the algorithm, software, and hardware stack. The framework facilitates reasoning about energy consumption and extends to throughput per watt analysis.
LLM task choice can swing inference energy by 25x, and video chews through 100x more power than images, revealing massive optimization potential in generative AI.
Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.