UC Santa CruzApr 23, 2026arXiv:2604.21375

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Q. Han, Haoqin Tu, Zijun Wang, Haoyu Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cárdenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, C. Xie

AI Summary

This paper introduces VLAA-GUI, a modular framework designed to enhance the performance of autonomous GUI agents by addressing early stopping and repetitive looping issues. The framework integrates a Completeness Verifier to ensure success criteria are met, a Loop Breaker to manage repeated failures, and an on-demand Search Agent that leverages LLMs for unfamiliar workflows. Evaluated on multiple backbones, VLAA-GUI achieved top performance on benchmark tasks, with some models exceeding human performance, demonstrating significant improvements in efficiency and effectiveness in GUI automation.

Key Contribution

VLAA-GUI's innovative framework allows autonomous agents to not only verify their success but also adaptively recover from failures, achieving human-level performance in GUI tasks.

Abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References76

Year2026

VenueN/A

Related Papers

Finding related papers...