Mar 12, 2026arXiv:2603.11987

LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

Qianpu Sun, Qian Sun, Xiaowei Chi, Yuhan Rui, Kuangzhi Ge, Jiajun Li, Sirui Han, Shanghang Zhang

AI Summary

The paper introduces LABSHIELD, a new multimodal benchmark for evaluating the safety reasoning capabilities of MLLM agents in simulated scientific laboratory environments. The benchmark is grounded in OSHA and GHS standards, encompassing 164 operational tasks with varying complexities and risk profiles. Evaluations of 20 proprietary, 9 open-source, and 3 embodied models reveal a significant performance drop (32%) in safety-critical reasoning compared to general-domain accuracy, highlighting the need for improved safety-aware planning in autonomous scientific experimentation.

Key Contribution

MLLMs stumble badly when asked to reason about safety in lab settings, dropping 32% in performance compared to general knowledge, revealing a critical gap for real-world deployment.

Abstract

Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References101

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

Related Papers