Microsoft ResearchJun 4, 2026arXiv:2606.05597

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Haoyue Bai, Ruiqi Yang, Chen Ye, Spencer Whitehead, Aviral Kumar, Tong Zhang

AI Summary

This paper introduces AsyncWebRL, an innovative approach to training vision-language web agents using multi-step reinforcement learning (RL) that significantly enhances efficiency by addressing both system and algorithmic inefficiencies. By implementing an asynchronous design that overlaps rollout, gradient updates, and policy refreshes, along with adaptations like an everlasting rollout pool, AsyncWebRL achieves up to a 2.9× speedup in training throughput compared to the fastest existing synchronous pipeline. Additionally, the authors tackle the inefficiency in multi-step GRPO by replacing the trajectory normalizer, leading to a new open-source state-of-the-art performance on the WebGym out-of-distribution test split, with notable improvements on challenging tasks.

Key Contribution

AsyncWebRL achieves a staggering 2.9× increase in training throughput while setting a new state-of-the-art performance for web agents on challenging tasks.

Abstract

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a $2.9\times$ end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer $1/|\tau_i|$ in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing $1/|\tau_i|$ with a constant $1/k$ breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).

Multimodal Models Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Related Papers