May 26, 2026arXiv:2605.27134

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu

AI Summary

This paper introduces HyperTrack, a large-scale dataset of 16,000 real-world mobile GUI navigation tasks across 650 Chinese apps, and GUIEvalKit, a benchmarking toolkit for VLMs in this domain. Using HyperTrack, the authors demonstrate that reinforcement learning fine-tuning consistently outperforms supervised fine-tuning, especially in out-of-domain scenarios. The benchmarking of SOTA VLMs reveals the importance of interaction history and reasoning for successful task completion.

Key Contribution

RL fine-tuning on a massive new mobile GUI dataset closes the sim2real gap, outperforming supervised methods and suggesting a path to more robust vision-language agents.

Abstract

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Related Papers