Search papers, labs, and topics across Lattice.
This cohort study (Level II) evaluated the accuracy and relevance of five AI chatbots (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) in answering 43 frequently asked questions about total knee replacement (TKR). The study found that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 provided highly accurate and relevant responses, while Gemini Advanced performed significantly worse, particularly regarding indications/outcomes and alternatives/variations.
GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide accurate and relevant information regarding TKR, suggesting potential utility in patient education and decision-making.
Background: Artificial intelligence (AI) chatbots are increasingly used for medical information provision. However, systematic evaluations of their accuracy and reliability in orthopaedic surgery, particularly in total knee replacement (TKR), remain limited. Purpose: To systematically compare and evaluate performances of various AI chatbots, focusing on their ability to provide accurate and reliable information related to TKR. Study Design: Cohort study; Level of evidence, 2. Methods: A total of 43 clinically relevant TKR-related frequently asked questions (FAQs) were selected based on Google search trends and expert consultation. Questions were categorized into 6 key domains: (1) general/procedure-related information, (2) indications and outcomes, (3) risks and complications, (4) pain and postoperative recovery, (5) specific activities after surgery, and (6) alternatives and variations. Each question was submitted to 5 different chatbot models (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) for response generation. Two independent orthopaedic surgeons assessed the chatbot's responses for both accuracy and relevance using a 5-point Likert scale. Responses were anonymized, blinding evaluators to the chatbot identities to prevent bias. Accuracy differences among the chatbot models were analyzed by analysis of variance, and relevance was compared using the Kruskal-Wallis test. Results: GPT-3.5 (4.8 ± 0.5), GPT-4 (4.9 ± 0.4), GPT-4 Omni (4.9 ± 0.3), and Gemini 1.5 (4.8 ± 0.4) demonstrated high accuracy, whereas Gemini Advanced scored significantly lower (4.1 ± 1.4) (P < .001). However, general/procedure-related information, risks and complications, pain and recovery, and postoperative activities showed no significant differences among chatbots. Gemini Advanced underperformed in indications and outcomes (P = .04) and alternatives and variations (P = .002). Regarding relevance, all chatbots except Gemini Advanced (36/43; 83.7%) achieved a 100% relevance rate (P < .001). Conclusion: This study demonstrates that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide highly accurate and relevant responses to TKR-related queries, while Gemini Advanced underperforms.