Search papers, labs, and topics across Lattice.
This paper introduces a compositional framework for generating power traces of LLM inference workloads, addressing the limitations of existing datacenter power models that fail to capture the dynamic power consumption patterns of GPUs during prefill, decode, and idle states. The framework learns from measured traces and synthesizes power profiles for new traffic conditions and serving configurations by modeling power as a function of workload-driven state transitions and configuration-specific power distributions. Experiments across various LLMs, tensor-parallel settings, and GPU generations demonstrate that the framework achieves median absolute energy error below 5% while preserving temporal autocorrelation, enabling accurate infrastructure evaluations.
Accurately simulate LLM inference power consumption at scale – from individual GPUs to entire datacenters – with a framework that learns from real-world traces and generalizes to unseen configurations.
Datacenter operators and electrical utilities rely on power traces at different spatiotemporal scales. Operators use fine-grained traces for provisioning, facility management, and scheduling, while utilities use site-level load profiles for capacity and interconnection planning. Existing datacenter power models do not capture LLM inference workloads, in which GPUs shift rapidly among compute-intensive prefill, lower-power decode, and idle states, and facility demand depends on how these states evolve and synchronize across many devices. We show that LLM inference power can be represented compositionally through two components: workload-driven transitions among operating states and configuration-specific power distributions within those states. Building on this observation, we develop a trace-generation framework that learns from measured traces and synthesizes power profiles for new traffic conditions and serving configurations. These traces aggregate from GPU servers to rack-, row-, and facility-scale load profiles at the temporal granularity required by the study. Across multiple LLMs, tensor-parallel settings, and GPU generations, our framework achieves median absolute energy error below 5% for most configurations while preserving temporal autocorrelation structure. The resulting traces support downstream analyses including oversubscription, power modulation, and utility-facing load characterization, enabling infrastructure evaluations that flat nameplate assumptions and static trace replay cannot support.