Search papers, labs, and topics across Lattice.
The paper introduces AndroidWMSearch, a tree search framework for mobile agents on Android that uses a learned world model to simulate the environment and evaluate actions before execution, addressing the challenge of irreversible operations on real devices. They train specialized LLMs as world models using a scalable data synthesis pipeline. Experiments on AndroidWorld benchmarks show that AndroidWMSearch outperforms the T3A agent by 4.7% and achieves a 3.0% performance gain over GPT-4o when using a dedicated Android-trained world model.
Training a specialized LLM as a world model for Android environments yields a 3% performance boost over GPT-4o in mobile agent tree search.
Mobile agents powered by large language models (LLMs) have demonstrated remarkable potential in automating operations on mobile devices. Recent studies have demonstrated that incorporating tree search methods and increasing testtime computation can enhance an agent's multi-step reasoning and planning capabilities. However, unlike simulated sandbox environments, Android is a dynamic environment with many irreversible operations, making tree search backtracking less feasible on the Android platform. To address this challenge, we propose AndroidWMSearch, a novel agent tree search framework that leverages a world model to emulate the Android environment. This framework allows the agent to evaluate and rank candidate actions through simulation before actual execution. We systematically explore this paradigm by: (1) Proposing a model-based Android tree search framework, AndroidWMSearch, in which LLMs are utilized both as world models and value functions. (2) Training specialized LLMs to act as world models, utilizing a scalable data synthesis pipeline for the training process. On the AndroidWorld benchmarks, our AndroidWMSearch surpasses the T3A agent by 4.7%, underscoring the effectiveness of our proposed framework. Moreover, utilizing our AndroidWM-7B, which is specifically trained for Android environments, as the world model results in a 3.0% performance gain compared to employing GPT-4o. These findings highlight the importance and efficacy of training a dedicated world model tailored for mobile agents.