Feb 16, 2026arXiv:2602.14564

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury

AI Summary

This paper benchmarks five LLMs (Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini) on the iCliniq medical QA dataset. It evaluates their zero-shot performance using BLEU and ROUGE metrics, finding that larger models generally perform better, but smaller models like Llama-4-Maverick-17B can offer efficiency trade-offs. The study highlights the increasing potential of LLMs for medical QA, especially in resource-constrained settings.

Key Contribution

Despite the trend of larger models leading in medical QA, Llama-4-Maverick-17B's competitive performance hints at valuable efficiency trade-offs for real-world deployment.

Abstract

Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Related Papers