Microsoft ResearchApr 14, 2026arXiv:2604.13019

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

AI Summary

This paper introduces a multi-turn approach to GUI grounding for Computer Use Agents (CUAs) that iteratively refines cursor localization using visual feedback. By allowing the agent to self-correct displacement errors, the method addresses the challenge of pixel-precise interaction in dense coding interfaces where single-shot prediction often fails. Experiments across GPT-4, Claude, and Qwen on coding benchmarks show the multi-turn refinement significantly improves click precision and task success compared to single-shot baselines.

Key Contribution

Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.

Abstract

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Related Papers