Graphic Era Deemed to be UniversityGraphic Era Institute of Medical SciencesKLE Academy of Higher Education and ResearchAug 8, 2025

Assessing the Performance of Large Language Models on the Foreign Medical Graduate Examination (FMGE): Insights from GPT-4 Turbo, Gemini Advanced, and LLaMA 3.1 (70B)

Rohan M. Pattanshetti, Savitri Sidddanagoudra, Sameeksha Chand, Punitha S, Rashmi Hebbar, Vaishnavi

AI Summary

This paper benchmarks GPT-4 Turbo, Gemini Advanced, and LLaMA 3.1 (70B) on a dataset of 427 multiple-choice questions from the Foreign Medical Graduate Examination (FMGE). The study found that GPT-4 Turbo achieved approximately 93% accuracy, outperforming Gemini Advanced and LLaMA 3.1 (70B), which both achieved around 87% accuracy. Statistical validation using McNemar’s test confirmed no significant difference in performance between Gemini Advanced and LLaMA 3.1 (70B).

Key Contribution

GPT-4 Turbo aces the Foreign Medical Graduate Exam with 93% accuracy, proving it's not just hype in healthcare.

Abstract

This study evaluates three advanced Large Language Models (LLMs)—GPT-4 Turbo, Gemini Advanced, and Meta’s LLaMA 3.1 (70B)—for accurately answering multiple-choice questions from the Foreign Medical Graduate Examination (FMGE). Using a curated set of 427 text-based questions from recent exams, the investigation assessed each model’s overall accuracy, consistency, and error distribution, with statistical validation via McNemar’s test. GPT-4 Turbo achieved roughly 93% accuracy, while Gemini Advanced and LLaMA 3.1 (70B) both approximated 87%, with no significant difference between them. High agreement among models indicates consistent decision-making and stable interpretative patterns. These findings underscore the potential of LLMs as complementary tools for exam preparation, particularly in resource-limited settings. Moreover, the results support the viability of open-source models—exemplified by Meta’s LLaMA 3.1 (70B)—in terms of cost-effectiveness and adaptability. Future research will explore diverse question formats and the integration of these models into clinical decision support systems to further enhance their role in modern medical education.

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References28

Year2025

Venue2025 International Conference on Biomedical Engineering and Sustainable Healthcare (ICBMESH)

Related Papers

Finding related papers...

Search

Assessing the Performance of Large Language Models on the Foreign Medical Graduate Examination (FMGE): Insights from GPT-4 Turbo, Gemini Advanced, and LLaMA 3.1 (70B)

Related Papers