Microsoft ResearchFeb 19, 2026arXiv:2602.17365

Computer-Using World Model

Yiming Guan, Rui Yu, John Zhang, John Zhang, John Zhang, Lu Wang, Liqun Li, Liqun Li, Bo Qiao, Bo Qiao, Si Qin, Si Qin, He Huang, He Huang, Fangkai Yang, Pu Zhao, Pu Zhao, Lukas Wutschitz, Lukas Wutschitz, Samuel Kessler, Samuel Kessler, Huseyin A Inan, Huseyin A Inan, Robert Sim, Robert Sim, S. Rajmohan, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

AI Summary

The paper introduces the Computer-Using World Model (CUWM), a world model designed for desktop software environments that predicts the next UI state based on the current state and a candidate action. CUWM factorizes UI dynamics into two stages: predicting textual descriptions of agent-relevant state changes and then synthesizing the next screenshot based on these descriptions. Trained on offline UI transitions from Microsoft Office applications and refined with reinforcement learning, CUWM improves decision quality and execution robustness through test-time action search.

Key Contribution

World models can now effectively simulate complex desktop software environments like Microsoft Office, enabling agents to reason about actions before execution and significantly improving performance.

Abstract

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Computer-Using World Model

Related Papers