Tsinghua AIByteDanceTonyUSTCFeb 27, 2026arXiv:2602.24286

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou

AI Summary

The paper introduces CUDA Agent, a reinforcement learning system designed to optimize CUDA kernel generation by endowing an LLM with CUDA development expertise. They achieve this through a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling, and RL algorithmic techniques for stable training. The CUDA Agent outperforms torch.compile and proprietary models like Claude Opus 4.5 and Gemini 3 Pro on KernelBench, demonstrating significant speedups in CUDA kernel generation.

Key Contribution

Agentic RL can now beat proprietary LLMs and torch.compile in the challenging domain of CUDA kernel generation, achieving up to 40% speedups on hard tasks.

Abstract

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Related Papers