Search papers, labs, and topics across Lattice.
The paper introduces LABSHIELD, a new multimodal benchmark for evaluating the safety reasoning capabilities of MLLM agents in simulated scientific laboratory environments. The benchmark is grounded in OSHA and GHS standards, encompassing 164 operational tasks with varying complexities and risk profiles. Evaluations of 20 proprietary, 9 open-source, and 3 embodied models reveal a significant performance drop (32%) in safety-critical reasoning compared to general-domain accuracy, highlighting the need for improved safety-aware planning in autonomous scientific experimentation.
MLLMs stumble badly when asked to reason about safety in lab settings, dropping 32% in performance compared to general knowledge, revealing a critical gap for real-world deployment.
Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.