Search papers, labs, and topics across Lattice.
This paper introduces Hierarchical Transition-Attended Memory (HTAM), a coarse-to-fine framework that leverages LLMs to optimize GPU kernels by organizing optimization experience at multiple granularities. HTAM constructs a two-level Hierarchical Transition Graph (HTG) to represent global optimization directions and detailed local strategies, along with transition experiences between optimization steps. Experiments on KernelBench demonstrate that HTAM improves correctness, fast-solution rate, and speedup compared to LLM-based baselines, with transferable benefits shown in backend and Robust-KBench studies.
LLMs can now optimize GPU kernels more effectively by learning from a structured memory of optimization strategies at different levels of abstraction.
High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.