Feb 15, 2026arXiv:2602.14276

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Peter Staar

AI Summary

The paper introduces ScreenParse, a large-scale dataset for complete screen parsing with dense annotations of all visible UI elements across 771K web screenshots. They also present ScreenVLM, a compact vision language model trained on ScreenParse that decodes a ScreenTag markup representation using a structure-aware loss. ScreenVLM significantly outperforms larger foundation VLMs on dense parsing and demonstrates strong transfer learning capabilities, indicating the value of dense screen supervision for UI understanding.

Key Contribution

Forget sparse annotations: a new dataset and compact VLM show that dense, complete screen parsing supervision unlocks substantial gains in UI understanding and grounding, even for large foundation models.

Abstract

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

Computer Vision Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

Related Papers