Tsinghua AIUniversity of Electronic Science and TechnologyMay 24, 2026arXiv:2605.25160

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Yunxin Liu, Yuanchun Li

AI Summary

SimuWoB, a new benchmark, addresses the limitations of existing mobile GUI agent benchmarks by providing a fully synthetic environment with 120 diverse and challenging tasks that closely mimic real-world mobile applications. The framework synthesizes high-fidelity tasks and environments, automatically providing valid rewards, and is deployed as a backend-free webpage for efficient and reproducible evaluation. Experiments with state-of-the-art agents reveal significant weaknesses, particularly in long-horizon tasks, with an average success rate of only 27.92%, highlighting the need for improved agent capabilities in complex scenarios.

Key Contribution

Current mobile GUI agents struggle with complex, long-horizon tasks in realistic simulated environments, achieving only a 17.82% success rate on SimuWoB.

Abstract

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Related Papers