Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study

聊天机器人威尔科克森符号秩检验范畴变量远程医疗电子健康对话医学序数回归描述性统计考试（生物学）病历家庭医学计算机科学医学教育医疗保健心理学人工智能机器学习统计内科学经济古生物学生物经济增长沟通数学曼惠特尼U检验

作者

X. X. Bai,Shiyong Wang,Yuanli Zhao,Ming Fei Feng,Wenbin Ma,Xiaomin Liu

出处

期刊：Journal of Medical Internet Research [JMIR Publications]
日期：2025-05-21 卷期号：27: e67462-e67462

链接

doi.org nih.gov nih.gov doaj.orgdoi.org

标识

DOI：10.2196/67462

摘要

Background Telemedicine, which incorporates artificial intelligence such as chatbots, offers significant potential for enhancing health care delivery. However, the efficacy of artificial intelligence chatbots compared to human physicians in clinical settings remains underexplored, particularly in complex scenarios involving patients with cancer and asynchronous text-based interactions. Objective This study aimed to evaluate the performance of the GPT-4 (OpenAI) chatbot in responding to asynchronous text-based medical messages from patients with cancer by comparing its responses with those of physicians across two clinical scenarios: patient education and medical decision-making. Methods We collected 4257 deidentified asynchronous text-based medical consultation records from 17 oncologists across China between January 1, 2020, and March 31, 2024. Each record included patient questions, demographic data, and disease-related details. The records were categorized into two scenarios: patient education (eg, symptom explanations and test interpretations) and medical decision-making (eg, treatment planning). The GPT-4 chatbot was used to simulate physician responses to these records, with each session conducted in a new conversation to avoid cross-session interference. The chatbot responses, along with the original physician responses, were evaluated by a medical review panel (3 oncologists) and a patient panel (20 patients with cancer). The medical panel assessed completeness, accuracy, and safety using a 3-level scale, whereas the patient panel rated completeness, trustworthiness, and empathy on a 5-point ordinal scale. Statistical analyses included chi-square tests for categorical variables and Wilcoxon signed-rank tests for ordinal ratings. Results In the patient education scenario (n=2364), the chatbot scored higher than physicians in completeness (n=2301, 97.34% vs n=2213, 93.61% for fully complete responses; P=.002), with no significant differences in accuracy or safety (P>.05). In the medical decision-making scenario (n=1893), the chatbot exhibited lower accuracy (n=1834, 96.88% vs n=1855, 97.99% for fully accurate responses; P<.001) and trustworthiness (n=860, 50.71% vs n=1766, 93.29% rated as “Moderately trustworthy” or higher; P<.001) compared with physicians. Regarding empathy, the medical review panel rated the chatbot as demonstrating higher empathy scores across both scenarios, whereas the patient review panel reached the opposite conclusion, consistently favoring physicians in empathetic communication. Errors in chatbot responses were primarily due to misinterpretations of medical terminology or the lack of updated guidelines, with 3.12% (59/1893) of its responses potentially leading to adverse outcomes, compared with 2.01% (38/1893) for physicians. Conclusions The GPT-4 chatbot performs comparably to physicians in patient education by providing comprehensive and empathetic responses. However, its reliability in medical decision-making remains limited, particularly in complex scenarios requiring nuanced clinical judgment. These findings underscore the chatbot’s potential as a supplementary tool in telemedicine while highlighting the need for physician oversight to ensure patient safety and accuracy.

求助该文献

最长约 10秒，即可获得该文献文件

Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study

今日热心研友