Search papers, labs, and topics across Lattice.
This paper introduces a framework for continual game generation using GUI agents, addressing the limitations of one-shot generation approaches that fail to account for interaction-level failures. They propose PlaytestArena, an evaluation environment using GUI agents to assess game playability against predefined rubrics, and Play2Code, a closed-loop system where a game agent and GUI agent iteratively refine the game through dialogue. Experiments demonstrate that Play2Code significantly improves game playability, achieving a 66.8% rubric pass rate, highlighting the value of interactive code generation informed by GUI playtesting.
Frontier models can't build playable games in one shot, but a closed-loop system using GUI agents to playtest and refine code achieves a 66.8% success rate, proving that game generation needs to be a conversation, not a translation.
Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.