ZJUApr 15, 2026arXiv:2604.13822

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI Summary

UI-Copilot is introduced as a framework for long-horizon GUI automation, pairing a GUI agent with a copilot for memory retrieval and numerical computation. Memory decoupling separates persistent observations from transient context, and the agent learns to invoke the copilot as either a Retriever or Calculator. Tool-Integrated Policy Optimization (TIPO) separately optimizes tool selection and task execution, leading to state-of-the-art performance on MemGUI-Bench and a 17.1% improvement on AndroidWorld compared to the base Qwen model.

Key Contribution

Offloading memory and computation to a copilot lets a 7B parameter GUI agent outperform larger models on long-horizon tasks, suggesting a path to more efficient and capable GUI automation.

Abstract

MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.

Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

Related Papers