Search papers, labs, and topics across Lattice.
The paper introduces a new benchmark dataset and testing framework, "Forbidden Science," to evaluate the ability of large language models to appropriately refuse harmful content related to controlled substances while avoiding over-restriction of legitimate scientific discourse. The authors analyzed the responses of Claude-3.5-sonnet, Mistral, GPT-3.5-turbo, and Grok-2 to systematically varied prompts, revealing significant differences in their safety profiles and response consistency. The study highlights potential vulnerabilities in current safety mechanisms through chain-of-thought analysis, emphasizing the difficulty of balancing safety and scientific inquiry.
LLMs exhibit wildly different safety profiles when probed about dual-use science, with refusal rates ranging from 0% to 73% depending on the model.
The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.