Mar 16, 2026arXiv:2603.15039

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang

AI Summary

GUI-CEval, a new benchmark, is introduced to evaluate Chinese mobile GUI agents, addressing the limitations of existing English-centric benchmarks that fail to capture the nuances of the Chinese mobile ecosystem. The benchmark encompasses 201 mainstream apps across four device types and uses a two-level structure to evaluate both atomic abilities and application-level performance across five dimensions. Experiments on 20 MLLMs reveal weaknesses in reflective decision-making and self-evaluation, highlighting areas for improvement in real-world interactions.

Key Contribution

Most MLLMs still struggle with reflective decision-making and self-evaluation in Chinese mobile GUI environments, hindering their reliability in real-world interactions.

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Related Papers