OpenAIBaruch CollegeMistralNorthwesternUniversity of California BerkleyUniversity of TriesteYaleMay 3, 2025

Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology

Mauro Giuffrè, Kisung You, Ziteng Pang, Simone Kresevic, Sunny Chung, Ryan Chen, Youngmin Ko, Colleen E. Chan, Theo Saarinen, Miloš Ajčević, L. Crocè, Guadalupe Garcia-Tsao, I. Gralnek, Joseph J. Y. Sung, A. Barkun, Loren Laine, Jasjeet Sekhon, Bradly C. Stadie, D. Shung

AI Summary

The paper introduces EVAL, a framework for evaluating and improving the safety of large language models (LLMs) in the context of upper gastrointestinal bleeding (UGIB) diagnosis and management. EVAL combines similarity-based ranking using Fine-Tuned ColBERT with a reward model trained on human-graded responses to enable rejection sampling, thereby improving accuracy. The framework demonstrates that Fine-Tuned ColBERT achieves high alignment with human performance (ρ = 0.81–0.91), and the reward model significantly enhances accuracy through rejection sampling by 8.36%.

Key Contribution

LLMs in gastroenterology can be made significantly safer: a new framework achieves near-human expert alignment and boosts accuracy by 8% via rejection sampling.

Abstract

Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI’s GPT-3.5/4/4o/o1-preview, Anthropic’s Claude-3-Opus, Meta’s LLaMA-2 (7B/13B/70B), and Mistral AI’s Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81–0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations5

Influential citations1

References51

Year2025

Venuenpj Digital Medicine

Related Papers

Finding related papers...

Search

Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology

Related Papers