Mar 5, 2025

Artificial intelligence chatbots in transfusion medicine: A cross-sectional study.

Prateek Srivastava, Ashish Tewari, A. Al‐Riyami

AI Summary

The study evaluated the accuracy, correctness, completeness, and safety of six AI chatbots (ChatGPT 4, ChatGPT 4-o, Gemini Advanced, Copilot, Anthropic Claude 3.5 Sonnet, Meta AI) in responding to transfusion medicine (TM)-related prompts at two time points. The responses were assessed by four TM experts, revealing inconsistencies and varying degrees of evolution over 30 days, with none providing entirely correct, complete, or safe answers. ChatGPT 4-o and Anthropic Claude 3.5 Sonnet demonstrated the highest accuracy and consistency, suggesting potential for future integration into TM practices with expert validation.

Key Contribution

Don't trust AI chatbots for transfusion medicine advice just yet: even the best models like ChatGPT 4-o and Claude 3.5 Sonnet still give unreliable and unsafe answers.

Abstract

BACKGROUND AND OBJECTIVES The recent rise of artificial intelligence (AI) chatbots has attracted many users worldwide. However, expert evaluation is essential before relying on them for transfusion medicine (TM)-related information. This study aims to evaluate the performance of AI chatbots for accuracy, correctness, completeness and safety. MATERIALS AND METHODS Six AI chatbots (ChatGPT 4, ChatGPT 4-o, Gemini Advanced, Copilot, Anthropic Claude 3.5 Sonnet, Meta AI) were tested using TM-related prompts at two time points, 30 days apart. Their responses were assessed by four TM experts. Evaluators' scores underwent inter-rater reliability testing. Responses from Day 30 were compared with those from Day 1 to evaluate consistency and potential evolution over time. RESULTS All six chatbots exhibited some level of inconsistency and varying degrees of evolution in their responses over 30 days. None provided entirely correct, complete or safe answers to all questions. Among the chatbots tested, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet demonstrated the highest accuracy and consistency, while Microsoft Copilot and Google Gemini Advanced showed the greatest evolution in their responses. As a limitation, the 30-day period may be too short for a precise assessment of chatbot evolution. CONCLUSION At the time of the conduct of this study, none of the AI chatbots provided fully reliable, complete or safe responses to all TM-related prompts. However, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet show the highest promise for future integration into TM practices. Given their variability and ongoing development, AI chatbots should not yet be relied upon as authoritative sources in TM without expert validation.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations3

Influential citations0

References19

Year2025

VenueVox Sanguinis

Related Papers

Finding related papers...

Search

Artificial intelligence chatbots in transfusion medicine: A cross-sectional study.

Related Papers