Search papers, labs, and topics across Lattice.
GUI-CEval, a new benchmark, is introduced to evaluate Chinese mobile GUI agents, addressing the limitations of existing English-centric benchmarks that fail to capture the nuances of the Chinese mobile ecosystem. The benchmark encompasses 201 mainstream apps across four device types and uses a two-level structure to evaluate both atomic abilities and application-level performance across five dimensions. Experiments on 20 MLLMs reveal weaknesses in reflective decision-making and self-evaluation, highlighting areas for improvement in real-world interactions.
Most MLLMs still struggle with reflective decision-making and self-evaluation in Chinese mobile GUI environments, hindering their reliability in real-world interactions.
Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.