FudanTongjiXiaohongshuδ University of CaliforniaMay 27, 2026arXiv:2605.28258

GUI Agents for Continual Game Generation

Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

AI Summary

This paper introduces a framework for continual game generation using GUI agents, addressing the limitations of one-shot generation approaches that fail to account for interaction-level failures. They propose PlaytestArena, an evaluation environment using GUI agents to assess game playability against predefined rubrics, and Play2Code, a closed-loop system where a game agent and GUI agent iteratively refine the game through dialogue. Experiments demonstrate that Play2Code significantly improves game playability, achieving a 66.8% rubric pass rate, highlighting the value of interactive code generation informed by GUI playtesting.

Key Contribution

Frontier models can't build playable games in one shot, but a closed-loop system using GUI agents to playtest and refine code achieves a 66.8% success rate, proving that game generation needs to be a conversation, not a translation.

Abstract

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GUI Agents for Continual Game Generation

Related Papers