Search papers, labs, and topics across Lattice.
The paper introduces EVAL, a framework for evaluating and improving the safety of large language models (LLMs) in the context of upper gastrointestinal bleeding (UGIB) diagnosis and management. EVAL combines similarity-based ranking using Fine-Tuned ColBERT with a reward model trained on human-graded responses to enable rejection sampling, thereby improving accuracy. The framework demonstrates that Fine-Tuned ColBERT achieves high alignment with human performance (ρ = 0.81–0.91), and the reward model significantly enhances accuracy through rejection sampling by 8.36%.
LLMs in gastroenterology can be made significantly safer: a new framework achieves near-human expert alignment and boosts accuracy by 8% via rejection sampling.
Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI’s GPT-3.5/4/4o/o1-preview, Anthropic’s Claude-3-Opus, Meta’s LLaMA-2 (7B/13B/70B), and Mistral AI’s Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81–0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.