Search papers, labs, and topics across Lattice.
This paper introduces AILFM, an Active Imitation Learning (AIL) framework, to address thermal management and performance optimization challenges in 3D S-NUCA many-core systems running Large Foundation Model (LFM) inference. AILFM learns near-optimal thermal-aware scheduling policies by imitating Oracle demonstrations, effectively managing thread migration and V/f scaling based on core-level performance heterogeneity and kernel-specific behavior. Experiments demonstrate that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads, showcasing its effectiveness in balancing thermal safety and performance.
Ditch the GPU? Active Imitation Learning can tame the thermal chaos of running large foundation models on 3D-stacked CPUs, unlocking a cost-effective alternative for LFM inference.
Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, the limited availability and high cost have motivated the adoption of high-performance general-purpose CPUs, especially emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. These architectures offer enhanced bandwidth and locality but suffer from severe thermal challenges and uneven cache latencies due to 3D Networks-on-Chip (NoC). Optimal management of thread migration and V/f scaling is non-trivial due to LFM kernel diversity and system heterogeneity. Existing thermal management approaches often rely on oversimplified analytical models and lack adaptability. We propose AILFM, an Active Imitation Learning (AIL)-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal run-time overhead. AILFM accounts for both core-level performance heterogeneity and kernel-specific behavior in LFMs to maintain thermal safety while maximizing performance. Extensive experiments show that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads.