Search papers, labs, and topics across Lattice.
This paper evaluates the performance of GPT-4o, Claude 4, and MedGEMMA on the task of automated Kellgren-Lawrence (KL) grading of knee osteoarthritis from radiographic images. The models were assessed using exact match accuracy, ±1 tolerance accuracy, macro-averaged precision, and recall against a dataset of 100 expert-annotated knee radiographs. GPT-4o achieved the highest performance with 26% exact match accuracy and 63% ±1 tolerance accuracy, but all models exhibited limitations, particularly in accurately classifying moderate to severe OA.
Despite their promise, even the best multimodal LLM (GPT-4o) achieves only 26% accuracy in grading knee osteoarthritis from radiographs, revealing a significant gap in clinical reliability.
Background. Automated grading of knee osteoarthritis (OA) severity using the Kellgren-Lawrence (KL) scale is a critical task in musculoskeletal radiology. Recently developed multimodal large language models (LLMs) offer the potential to interpret clinical images alongside text, but their performance on fine-grained ordinal classification tasks remains poorly characterized. Methods. We evaluated three multimodal LLMs—Open AI GPT-4o (vision-enabled), Anthropic Claude 4, and the open source Google MedGEMMA—on their ability to predict KL grades from knee radiographs in a publicly available, expert-annotated dataset¹. Model predictions were compared to ground truth labels using exact match accuracy, ±1 tolerance accuracy (i.e., prediction within one KL grade)(Figure 1), and macro-averaged precision and recall. Confusion matrices were also analyzed to examine misclassification trends. Results. The dataset included 100 radiographic images, equally distributed across KL grades: • Grade 0: 20 images • Grade 1: 20 images • Grade 2: 20 images • Grade 3: 20 images • Grade 4: 20 images GPT-4o demonstrated the best overall performance with 26% exact match accuracy, 63% ±1 tolerance accuracy, macro precision 0.38, and macro recall 0.26. Claude 4 and MedGEMMA each reached 21% exact match, 58% ±1 tolerance, with macro precision/recall of 0.23/0.21 and 0.20/0.21, respectively (Figure 2). All models exhibited frequent misclassification between adjacent KL grades, particularly underestimating moderate to severe OA (grades 3–4). Conclusion. Although GPT-4o outperformed the other models, its accuracy remains insufficient for clinical reliability. These findings reveal that current multimodal LLMs, while promising, still struggle with ordinal radiographic interpretation tasks. Targeted training on medical imaging datasets and improved domain adaptation are necessary to enhance their diagnostic utility. ¹