XidianZJUApr 8, 2026arXiv:2604.06995

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen

AI Summary

The paper introduces UI-in-the-Loop (UILoop), a new paradigm for GUI reasoning that explicitly models the interaction between the screen, UI elements, and actions. UILoop trains MLLMs to localize, understand the semantics, and learn the usage of UI elements, enabling more precise element discovery and interpretable reasoning. Experiments on a new UI Comprehension benchmark (UI Comprehension-Bench) demonstrate that UILoop achieves state-of-the-art UI understanding and superior GUI reasoning performance.

Key Contribution

Current GUI reasoning models fail because they lack a comprehensive understanding of UI elements; UILoop fixes this by explicitly modeling UI element localization, function, and usage, leading to SOTA performance.

Abstract

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

Related Papers