Search papers, labs, and topics across Lattice.
This paper benchmarks GPT-4 Turbo, Gemini Advanced, and LLaMA 3.1 (70B) on a dataset of 427 multiple-choice questions from the Foreign Medical Graduate Examination (FMGE). The study found that GPT-4 Turbo achieved approximately 93% accuracy, outperforming Gemini Advanced and LLaMA 3.1 (70B), which both achieved around 87% accuracy. Statistical validation using McNemar’s test confirmed no significant difference in performance between Gemini Advanced and LLaMA 3.1 (70B).
GPT-4 Turbo aces the Foreign Medical Graduate Exam with 93% accuracy, proving it's not just hype in healthcare.
This study evaluates three advanced Large Language Models (LLMs)—GPT-4 Turbo, Gemini Advanced, and Meta’s LLaMA 3.1 (70B)—for accurately answering multiple-choice questions from the Foreign Medical Graduate Examination (FMGE). Using a curated set of 427 text-based questions from recent exams, the investigation assessed each model’s overall accuracy, consistency, and error distribution, with statistical validation via McNemar’s test. GPT-4 Turbo achieved roughly 93% accuracy, while Gemini Advanced and LLaMA 3.1 (70B) both approximated 87%, with no significant difference between them. High agreement among models indicates consistent decision-making and stable interpretative patterns. These findings underscore the potential of LLMs as complementary tools for exam preparation, particularly in resource-limited settings. Moreover, the results support the viability of open-source models—exemplified by Meta’s LLaMA 3.1 (70B)—in terms of cost-effectiveness and adaptability. Future research will explore diverse question formats and the integration of these models into clinical decision support systems to further enhance their role in modern medical education.