College of MedicineJan 13, 2026

A Comparative Analysis of GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5 in Answering Total Knee Replacement−Related Questions

Hyobeom Lee, Ji-Sun Shin, Y. Seo, Seong Hyeon Kim, Jongryul Lee, Seung Yong Shin, S. Song

AI Summary

This cohort study (Level II) evaluated the accuracy and relevance of five AI chatbots (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) in answering 43 frequently asked questions about total knee replacement (TKR). The study found that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 provided highly accurate and relevant responses, while Gemini Advanced performed significantly worse, particularly regarding indications/outcomes and alternatives/variations.

Key Contribution

GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide accurate and relevant information regarding TKR, suggesting potential utility in patient education and decision-making.

Abstract

Background: Artificial intelligence (AI) chatbots are increasingly used for medical information provision. However, systematic evaluations of their accuracy and reliability in orthopaedic surgery, particularly in total knee replacement (TKR), remain limited. Purpose: To systematically compare and evaluate performances of various AI chatbots, focusing on their ability to provide accurate and reliable information related to TKR. Study Design: Cohort study; Level of evidence, 2. Methods: A total of 43 clinically relevant TKR-related frequently asked questions (FAQs) were selected based on Google search trends and expert consultation. Questions were categorized into 6 key domains: (1) general/procedure-related information, (2) indications and outcomes, (3) risks and complications, (4) pain and postoperative recovery, (5) specific activities after surgery, and (6) alternatives and variations. Each question was submitted to 5 different chatbot models (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) for response generation. Two independent orthopaedic surgeons assessed the chatbot's responses for both accuracy and relevance using a 5-point Likert scale. Responses were anonymized, blinding evaluators to the chatbot identities to prevent bias. Accuracy differences among the chatbot models were analyzed by analysis of variance, and relevance was compared using the Kruskal-Wallis test. Results: GPT-3.5 (4.8 ± 0.5), GPT-4 (4.9 ± 0.4), GPT-4 Omni (4.9 ± 0.3), and Gemini 1.5 (4.8 ± 0.4) demonstrated high accuracy, whereas Gemini Advanced scored significantly lower (4.1 ± 1.4) (P < .001). However, general/procedure-related information, risks and complications, pain and recovery, and postoperative activities showed no significant differences among chatbots. Gemini Advanced underperformed in indications and outcomes (P = .04) and alternatives and variations (P = .002). Regarding relevance, all chatbots except Gemini Advanced (36/43; 83.7%) achieved a 100% relevance rate (P < .001). Conclusion: This study demonstrates that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide highly accurate and relevant responses to TKR-related queries, while Gemini Advanced underperforms.

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueOrthopaedic Journal of Sports Medicine

Related Papers

Finding related papers...

Search

A Comparative Analysis of GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5 in Answering Total Knee Replacement−Related Questions

Related Papers