Search papers, labs, and topics across Lattice.
This paper introduces LabOSBench, a benchmark designed to evaluate multimodal GUI agents in the context of scientific instrument control, addressing the limitations of existing benchmarks that focus solely on software operation tasks. By creating a suite of web-based scientific-instrument simulators, LabOSBench allows for scalable and safe benchmarking without the high costs and safety risks associated with physical instruments. The evaluation of various models reveals that while agents can handle structured tasks, they struggle with feedback-driven operations and executing long-horizon workflows, highlighting significant areas for improvement in agent capabilities.
Existing agents excel at structured tasks but falter in feedback-driven operations and long-horizon workflows, revealing critical gaps in their capabilities for scientific instrument control.
Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.