Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions

关节置换术清晰骨科手术相关性（法律）一致性（知识库）物理疗法医学外科人工智能计算机科学生物化学化学政治学法学

作者

Matthew L. Magruder,Ariel N. Rodriguez,Jason Wong,Orry Erez,Nicolás S. Piuzzi,Gil R. Scuderi,James Slover,Jason H. Oh,Ran Schwarzkopf,Antonia F. Chen,Richard Iorio,Stuart B. Goodman,Michael A. Mont

出处

期刊：Journal of Arthroplasty [Elsevier BV]
日期：2024-02-14 卷期号：39 (8): 2022-2027 被引量：9

链接

nih.govdoi.org

标识

DOI：10.1016/j.arth.2024.02.023

摘要

Introduction Artificial intelligence (AI) in the field of orthopaedics has been a topic of increasing interest and opportunity in recent years. Its applications are widespread both for physicians and patients, including use in clinical decision-making, in the operating room, and in research. In this study, we aimed to assess the quality of ChatGPT answers when asked questions related to total knee arthroplasty (TKA). Methods ChatGPT prompts were created by turning 15 of the American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines into questions. An online survey was created, which included screenshots of each prompt and answers to the 15 questions. Surgeons were asked to grade ChatGPT answers from 1 to 5 based on their characteristics: 1) Relevance; 2) Accuracy; 3) Clarity; 4) Completeness; 5) Evidence-based; and 6) Consistency. There were eleven Adult Joint Reconstruction fellowship-trained surgeons who completed the survey. Questions were subclassified based on the subject of the prompt: 1) risk factors, 2) implant/Intraoperative, and 3) pain/functional outcomes. The average and standard deviation for all answers, as well as for each subgroup, were calculated. Inter-rater reliability (IRR) was also calculated. Results All answer characteristics were graded as being above average (i.e., a score > 3). Relevance demonstrated the highest scores (4.43±0.77) by surgeons surveyed, and consistency demonstrated the lowest scores (3.54±1.10). ChatGPT prompts in the Risk Factors group demonstrated the best responses, while those in the Pain/Functional Outcome group demonstrated the lowest. The overall IRR was found to be 0.33 (poor reliability), with the highest IRR for relevance (0.43) and the lowest for evidence-based (0.28). Conclusion ChatGPT can answer questions regarding well-established clinical guidelines in TKA with above-average accuracy but demonstrates variable reliability. This investigation is the first step in understanding large language model (LLM) AIs like ChatGPT and how well they perform in the field of arthroplasty.

求助该文献

最长约 10秒，即可获得该文献文件

Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions

今日热心研友