Jan 1, 2026arXiv:2601.00227

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie F. Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, Tianqi Chen

AI Summary

The paper introduces FlashInfer-Bench, a closed-loop framework for developing and deploying AI-generated GPU kernels in LLM inference systems. It provides a standardized schema (FlashInfer Trace) for kernel definition, benchmarking, and deployment, facilitating communication between LLM agents and inference systems. The framework includes a curated dataset, a benchmarking system, a leaderboard, and a dynamic substitution mechanism for integrating kernels into engines like SGLang and vLLM, enabling continuous improvement of AI-generated kernels.

Key Contribution

LLMs can now autonomously generate and deploy GPU kernels into production LLM engines, thanks to a new standardized framework for benchmarking and integrating these AI-generated kernels.

Abstract

Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents'GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.

Citation Metrics

Citations2

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Related Papers