ZJUApr 15, 2026arXiv:2604.14113

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Tong-I Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI Summary

This paper introduces UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding that selectively zooms in on uncertain regions of a GUI screenshot based on a novel uncertainty quantification method. UI-Zoomer uses a confidence-aware gate to determine when to trigger zoom-in based on spatial consensus and token-level generation confidence, and it employs an uncertainty-driven crop sizing module to determine the zoom scale. Experiments on three GUI datasets demonstrate that UI-Zoomer consistently improves localization accuracy over strong baselines without requiring additional training.

Key Contribution

Uncertainty-driven zoom-in boosts GUI grounding accuracy by up to 13.4% without any retraining, showing that targeted attention to model uncertainty can significantly improve performance.

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Related Papers