Search papers, labs, and topics across Lattice.
This paper introduces ReVision, a method that enhances the efficiency of computer-use agents (CUAs) by reducing visual token usage through a learned patch selector that eliminates redundant visual patches from consecutive screenshots. By applying ReVision to multimodal language models like Qwen2.5-VL-7B, the authors achieved a 46% reduction in token usage while simultaneously improving the success rate by 3% across three benchmarks: OSWorld, WebTailBench, and AgentNetBench. This work not only demonstrates a significant efficiency gain but also reveals that CUAs can benefit from incorporating more historical data when redundancy is minimized.
Reducing visual token usage by 46% while improving performance shows that CUAs can leverage more historical data effectively without overwhelming compute budgets.
Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.