Mar 31, 2026arXiv:2603.29608

Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher

AI Summary

The authors introduce DeToxR, a reinforcement learning (RL) framework for improving LLM-based diagnostic reasoning in emergency toxicology. They fine-tune an LLM using Group Relative Policy Optimization (GRPO) with a clinical performance reward based on a multi-label agreement metric, penalizing the model for both omissions and hallucinations of ingested substances. DeToxR significantly outperforms both its unadapted LLM counterpart and supervised baselines, even demonstrating superior performance compared to an expert toxicologist in a clinical validation study.

Key Contribution

An RL-aligned LLM can outperform expert toxicologists in identifying ingested substances from heterogeneous clinical data, suggesting a path to AI-assisted decision-making in high-stakes medical environments.

Abstract

Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Diagnostic Reasoning for Decision Support in Toxicology

Related Papers