Tsinghua AICUHKHKULIGHTSPEEDUniversity of California Los AngelesJun 8, 2026arXiv:2606.09826

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi

AI Summary

This paper introduces OmniGameArena, a comprehensive benchmark designed for evaluating vision-language model (VLM) agents across a diverse set of twelve Unreal Engine 5 games, encompassing Solo, PvP, and Cooperative modes. By employing the Improvement Dynamics Curve (IDC), the study enables a nuanced analysis of agent performance over multiple rounds of reflection, revealing not only initial scores but also the evolution of skills and adaptability to novel tasks. The findings highlight significant differences in performance dynamics among various agent types, offering deeper insights into their capabilities and limitations in interactive environments.

Key Contribution

VLM agents exhibit vastly different skill evolution patterns, revealing that initial performance scores can be misleading without considering improvement dynamics.

Abstract

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Related Papers