Search papers, labs, and topics across Lattice.
RiskWebWorld, a new interactive benchmark, is introduced to evaluate GUI agents in realistic e-commerce risk management scenarios, featuring 1,513 tasks from production risk-control pipelines. Evaluation of diverse models on RiskWebWorld reveals a significant performance gap, with top-tier generalist models achieving 49.1% success while specialized GUI models fail, suggesting that foundation model scale currently outweighs zero-shot interface grounding. The benchmark's Gymnasium-compliant infrastructure also enables agentic RL, improving open-source models by 16.2%, demonstrating its utility for developing robust digital workers.
Generalist foundation models beat specialized GUI agents at e-commerce risk management, suggesting scale trumps zero-shot grounding for complex, real-world web tasks.
Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.