Search papers, labs, and topics across Lattice.
This paper introduces iOSWorld, the first benchmark designed for evaluating personally intelligent phone agents by leveraging a persistent user identity across 26 newly developed iOS apps. The benchmark includes 133 tasks that assess agents' abilities to reason over interconnected personal data, with evaluations revealing that while the best models achieve a 52% success rate overall, they struggle significantly with multi-app tasks, scoring only 37%. The findings highlight the challenges of personal data integration in mobile agent performance and the potential for improved outcomes with privileged access to additional data formats.
Personalization is key: agents struggle with multi-app tasks, achieving only 37% accuracy despite an overall score of 52%.
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.