Zhiqiang Xia

HyperAI Team, Xiaomi Corporation {liyang134,xujiaming1}@xiaomi.com These authors contributed equally.Corresponding author Abstract Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents. Figure 1: Results of six representative multimodal large language models across seven tasks defined in GUI-CEval. The uneven radar profiles show that GUI-CEval offers a comprehensive examination of perception–to–execution capabilities and poses a substantial challenge to current models. 1 Introduction Figure 2: A representative example illustrating how GUI-CEval provides a comprehensive analysis of a Chinese mobile instruction. It evaluates realistic GUI application ability using tasks constructed from real mobile environments and trajectories, while also assessing atomic skills via single-answer multiple-choice questions, enabling developers to diagnose and improve model weaknesses. The rapid advancement of multimodal large language models (MLLMs) [1, 46, 37, 41] has empowered GUI Agents with the ability to perceive, reason, and act within real graphical interfaces [28, 45], enabling intelligent interaction and automation across mobile environments [16, 32, 42]. Although many existing benchmarks such as Screenspot [5], Screenspot Pro [11], AndroidControl [12], and AndroidWorld [26] have advanced the capabilities of GUI Agents, several limitations persist: (i) language bias: most are English-centric [14, 6, 17], limiting evaluation in Chinese ecosystems [44]; (ii) scene inconsistency: data are collected from diverse platforms [5, 35], lacking focused assessment on mobile environments; (iii) task narrowness: current benchmarks emphasize UI element localization [5, 11] or offline agent success rates [12, 2], offering limited insight into comprehensive assessment and full-pipeline capabilities; and (iv) data realism: automated collection [10, 29] and validation overlook real user intents, reducing practical reliability. To establish a fair, comparable, and diagnostic evaluation standard for Chinese mobile environments, we propose GUI-CEval, the first comprehensive benchmark for mobile GUI agents tailored to the Chinese ecosystem, as shown in Fig. 2. GUI-CEval spans 201 mainstream Chinese applications across four real mobile device types, consisting of 4,028 agent tasks and 4,194 multimodal question–answering (QA) tasks. The benchmark adopts a hierarchical design that integrates both fundamental and applied capabilities, defining five core dimensions aligned with the complete workflow of a mobile GUI agent: perception, planning, reflection, execution, and evaluation. The fundamental capabilities are examined through diagnostic multimodal QA tasks, focusing on atomic skills, enabling fine-grained capability analysis, and guiding model improvement. While the application tasks cover three critical scenarios—GUI grounding, offline agent, and online agent—to assess end-to-end performance from target localization to action execution. All data are human-curated on real mobile devices and verified, ensuring realistic interaction contexts and coherent task flows [9]. This design substantially improves the credibility, realism, and diagnostic reliability of GUI agent evaluation. The datasets and evaluation code will be released to advance the development of Chinese mobile GUI agents. We evaluate 20 representative multimodal models, GUI-specific models, and multi-agent systems on GUI-CEval. Results in Fig. 1 reveal that current models are not yet ready for stable deployment in real Chinese mobile environments. In summary, our contributions are as follows: 1. Comprehensive Chinese mobile benchmark. We present GUI-CEval, the first large-scale benchmark for Chinese mobile GUI agents, covering 201 mainstream apps and 4 real device types with 4,194 multimodal QA and 4,028 agent tasks for comprehensive and fine-grained evaluation. 2. Hierarchical five-dimensional diagnostic framework. GUI-CEval introduces a hierarchical structure spanning perception, planning, reflection, execution, and evaluation. It unifies GUI grounding, offline, and online agent scenarios to enable fine-grained, end-to-end capability diagnosis. 3. Human-verified real-world data pipeline. All data are collected and annotated through real device demonstrations and human review, ensuring realistic interaction contexts and preventing data leakage or template bias. 4. Extensive evaluation and insights. Experiments on 20 representative models reveal persistent weaknesses in generalization, stability, and reflective reasoning, highlighting GUI-CEval’s value as a diagnostic and developmental foundation for Chinese mobile GUI agents. 2 Related Work GUI Agent. Current approaches generally fall into two main paradigms: (i) The workflow paradigm [34, 33], which uses carefully designed prompts containing task descriptions, UI states, action histories, and reflective reasoning, often combined with grounding [15] or OCR [23] models to form multi-component execution pipelines; (ii) The end-to-end paradigm [5, 39, 20, 21, 31], which trains MLLMs through supervised learning or reinforcement learning to unify perception, reasoning, and action generation within a single model, offering better execution speed and lower resource consumption. Despite these advances, mobile GUI agents still rely heavily on robust error recovery mechanisms and fall far short of human performance in real-world scenarios. GUI Benchmarks. Existing GUI benchmarks mainly fall into three categories: (i) Grounding benchmarks, such as Screenspot [5], Screenspot Pro [11], UI-E

Papers on Lattice

Total citations

Topics

Publication activitypapers/week, last 8 weeks

Research focus

Eval Frameworks & Benchmarks (1)Multimodal Models (1)Tool Use & Agents (1)

Frequent co-authors

Haoyu Lu (1)Hongzhen Wang (1)Kaiyang Han (1)Changpeng Yang (1)

Papers (1)

Mar 16, 2026

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Most MLLMs still struggle with reflective decision-making and self-evaluation in Chinese mobile GUI environments, hindering their reliability in real-world interactions.

Haoyu Lu, Zhiqiang Xia, Hongzhen Wang +4

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Search

Zhiqiang Xia

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (1)