Search papers, labs, and topics across Lattice.
AccelOpt, a self-improving LLM agentic system, automates AI accelerator kernel optimization by iteratively generating and evaluating kernels based on past optimization experiences. Evaluated on NKIBench, a new benchmark suite of AWS Trainium kernels, AccelOpt improves average peak throughput from 49% to 61% on Trainium 1 and 45% to 59% on Trainium 2. Remarkably, AccelOpt achieves comparable kernel improvements to Claude Sonnet 4 at 1/26th the cost, demonstrating the potential of open-source LLMs for hardware optimization.
Open-source LLMs can now autonomously optimize AI accelerator kernels, matching the performance of proprietary models at a fraction of the cost.
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.