StepFunMay 26, 2026arXiv:2605.27761

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

Yifan Sui, Xinmiao Huang, Hongbing Li, Jiahe Lv, Haolong Yan, Yeqing Shen, Litao Liu, Zhimin Fan, Ziyang Meng, Jia Wang, J. Qi, Kaijun Tan, Zheng Ge, Daxin Jiang, Osamu Yoshie

AI Summary

AndroidDaily is introduced as a benchmark of 350 daily-use tasks across 94 real-world, closed-source Android applications to evaluate mobile GUI agents. To address the challenge of verifying agent behavior without access to internal states, the GRADE evaluator is proposed, which uses observable external guidelines to assess agent performance. Experiments demonstrate that GRADE aligns well with human evaluations and reveals a significant performance gap for current models on realistic mobile workflows, with the best agent achieving only a 62% success rate.

Key Contribution

Current mobile GUI agents are surprisingly inept at everyday smartphone tasks, achieving only 62% success on a new benchmark of real-world Android apps.

Abstract

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

Related Papers