NUSCUHKDualverseAIPKUShenzhen Loop Area InstituteSJTUTencent AIJun 16, 2026arXiv:2606.17861

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang

AI Summary

This paper introduces GameCraft-Bench, a novel benchmark designed to evaluate the ability of coding agents to generate playable games end-to-end within a game engine, specifically targeting the Godot platform. The authors establish a framework that emphasizes Engine Grounding, Artifact Completeness, and Interactive Verification to assess the quality of generated games through player interactions and multimodal evaluations. Despite the advancements in coding agents, the results reveal that even the best-performing agent only achieves a score of 41.46%, indicating significant challenges in producing fully functional and engaging game experiences.

Key Contribution

Coding agents struggle to create complete and engaging games, with top performers barely reaching 41.46% success in end-to-end game generation.

Abstract

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

Code Generation & Program Synthesis Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Related Papers