Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study

指南多学科方法利克特量表医疗保健等级间信度干预（咨询）心理学医学人工智能家庭医学计算机科学护理部政治学病理评定量表法学发展心理学

作者

Tianyi Wang,Ruiyuan Chen,B.C.M. Wang,Congying Zou,Ning Fan,Shuo Yuan,Aobo Wang,Yu Xi,Lei Zang

出处

期刊：Journal of Medical Internet Research [JMIR Publications]
日期：2025-07-31 卷期号：27: e75567-e75567 被引量：3

链接

doi.org nih.gov doaj.org nih.govdoi.org

标识

DOI：10.2196/75567

摘要

Abstract Background Surgical site infection (SSI) is the most prevalent type of health care–associated infection that leads to increased morbidity and mortality and a significant economic burden. Effective prevention of SSI relies on surgeons strictly following the latest clinical guidelines and implementing standardized and multilevel intervention strategies. However, the frequent updates to clinical guidelines render the processes of acquisition and interpretation quite time-consuming and intricate. The emergence of artificial intelligence (AI) chatbots offers both possibilities and challenges to address these issues in the surgical field. Objective This study aimed to test the multidimensional capability of state-of-the-art AI chatbots for generating proper recommendations and corresponding rationales concordant with the global guideline for the prevention of SSI. Methods Referred by other authoritative guidelines, recommendations and corresponding rationales from the 2018 World Health Organization global guidelines were refined and selected as benchmarks. Then, they were rephrased into a combined format of closed-ended queries for recommendations and open-ended queries for corresponding rationales, whereafter input into ChatGPT-4o (OpenAI), OpenAI-o1 (OpenAI), Claude 3.5 Sonnet (Anthropic), and Gemini 1.5 Pro (Google) 3 times. All responses were individually evaluated in 10 evaluation metrics based on the QUEST dimensions by 4 multidisciplinary senior surgeons using a 5-point Likert scale. The multidimensional performances among chatbots were compared, and the interrater agreements were calculated. Results A total of 300 responses to 25 queries were generated by the 4 chatbots. The interrater agreements of the evaluators ranged from moderate to good (0.54‐0.87). In response to recommendations, the average accuracy, consistency, and harm scores for all chatbots were 4.03 (SD 1.09), 4.07 (SD 0.88), and 4.29 (SD 1.01), respectively. In responses for rationales, 4 subdimensions, including harm (mean 4.22, SD 0.97), relevance (mean 4.15, SD 0.83), fabrication and falsification (mean 4.12, SD 1.02), and understanding and reasoning (mean 4.04, SD 0.92) averagely scored ≥4. In contrast, consistency (mean 3.94, SD 0.72), clarity (mean 3.94, SD 0.89), comprehensiveness (mean 3.85, SD 0.83), and accuracy (mean 3.74, SD 0.91) performed at a moderate level. For the whole responses, the average self-awareness and trust and confidence scores for all chatbots were 3.84 (SD 0.89) and 3.88 (SD 0.91), respectively. Based on the average scores of the subdimensions, Claude 3.5 Sonnet and ChatGPT-4o were the top 2 outperformed models. Conclusions The performance of AI chatbots in providing responses regarding well-established global guidelines in the prevention of SSI was acceptable, demonstrating immense potential in clinical applications. Nonetheless, a critical issue is the necessity of enhancing the stability of chatbots, as inaccurate responses can lead to severe consequences for SSI. Despite its limitations, it is anticipated that AI will trigger far-reaching changes in how clinicians access and use medical information.

求助该文献

Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study

今日热心研友