Feb 26, 2026arXiv:2602.23329

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Chen Bo Calvin Zhang, Christina Q Knight, Christina Q. Knight, Nicholas Kruus, Nicholas Kruus, Jason Hausenloy, Jason Hausenloy, Pedro Medeiros, Pedro Medeiros, Nathaniel Li, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Yury Orlovskiy, Coleman E Breen, Coleman Breen, Bryce Cai, Bryce Cai, Jasper Gotting, Jasper Götting, Andrew Bo Liu, A. B. Liu, S. Nedungadi, Samira Nedungadi, Paula Rodriguez, Paula Rodriguez, Yannis Yiming He, Mohamed E A Shaaban, Mohamed Shaaban, Zifan Wang, Zifan Wang, Seth Donoughe, Seth Donoughe, Julian Michael, Julian Michael

AI Summary

This study investigates whether LLMs can improve novice performance on biosecurity-relevant tasks compared to internet-only resources, assessing both scientific acceleration and dual-use risks. The research involved a human uplift study across eight biosecurity task sets, comparing novices with LLM access to those with internet-only access. The key finding is that LLM access significantly improved novice accuracy by 4.16 times, with LLM-assisted novices even outperforming experts on some benchmarks, although standalone LLMs often performed better than LLM-assisted novices.

Key Contribution

LLMs can boost novice performance on complex biosecurity tasks to surpass even expert-level benchmarks, but users struggle to fully leverage the models' capabilities.

Abstract

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Related Papers