Feb 18, 2026arXiv:2602.16703

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Shen Zhou Hong, Alexander Kleinman, Alyssa Mathiowetz, Alex Kleinman, Adam Howes, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Suveer Ganta, Alex Letizia, Dora Liao, Alex Letizia, Deepika Pahari, Dora Liao, Xavier Roberts-Gaal, Deepika Pahari, Luca Righetti, Luca Righetti, Joe Torres

AI Summary

This study investigates whether LLM assistance improves novice performance on a viral reverse genetics workflow via a pre-registered, investigator-blinded, randomized controlled trial (n=153) conducted between June-August 2025. The primary endpoint of workflow completion showed no significant difference between the LLM and Internet control arms, though individual tasks showed numerically higher success rates in the LLM arm, particularly for cell culture. Bayesian modeling suggests a modest performance benefit from LLM assistance, indicating that while LLMs don't substantially increase completion rates, they may improve progression through intermediate steps.

Key Contribution

Despite strong performance on biological benchmarks, mid-2025 LLMs offer surprisingly little boost to novices completing complex lab procedures in the real world.

Abstract

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a"typical"reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Related Papers