Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions

可读性医学一致性（知识库）评定量表医疗保健家庭医学心理学计算机科学人工智能哲学发展心理学经济语言学经济增长

作者

Jason Zarahi Amaral,Rebecca J. Schultz,Benjamin M. Martin,Tristen N. Taylor,Basel M. Touban,Jessica A. McGraw-Heinrich,Scott McKay,Scott Rosenfeld,Brian G. Smith

出处

期刊：Journal of Pediatric Orthopaedics [Lippincott Williams & Wilkins]
日期：2024-04-30 卷期号：44 (7): e592-e597 被引量：7

链接

nih.govdoi.org

标识

DOI：10.1097/bpo.0000000000002695

摘要

Objective: Chat generative pre-trained transformer (ChatGPT) has garnered attention in health care for its potential to reshape patient interactions. As patients increasingly rely on artificial intelligence platforms, concerns about information accuracy arise. In-toeing, a common lower extremity variation, often leads to pediatric orthopaedic referrals despite observation being the primary treatment. Our study aims to assess ChatGPT’s responses to pediatric in-toeing questions, contributing to discussions on health care innovation and technology in patient education. Methods: We compiled a list of 34 common in-toeing questions from the “Frequently Asked Questions” sections of 9 health care–affiliated websites, identifying 25 as the most encountered. On January 17, 2024, we queried ChatGPT 3.5 in separate sessions and recorded the responses. These 25 questions were posed again on January 21, 2024, to assess its reproducibility. Two pediatric orthopaedic surgeons evaluated responses using a scale of “excellent (no clarification)” to “unsatisfactory (substantial clarification).” Average ratings were used when evaluators’ grades were within one level of each other. In discordant cases, the senior author provided a decisive rating. Results: We found 46% of ChatGPT responses were “excellent” and 44% “satisfactory (minimal clarification).” In addition, 8% of cases were “satisfactory (moderate clarification)” and 2% were “unsatisfactory.” Questions had appropriate readability, with an average Flesch-Kincaid Grade Level of 4.9 (±2.1). However, ChatGPT’s responses were at a collegiate level, averaging 12.7 (±1.4). No significant differences in ratings were observed between question topics. Furthermore, ChatGPT exhibited moderate consistency after repeated queries, evidenced by a Spearman rho coefficient of 0.55 ( P = 0.005). The chatbot appropriately described in-toeing as normal or spontaneously resolving in 62% of responses and consistently recommended evaluation by a health care provider in 100%. Conclusion: The chatbot presented a serviceable, though not perfect, representation of the diagnosis and management of pediatric in-toeing while demonstrating a moderate level of reproducibility in its responses. ChatGPT’s utility could be enhanced by improving readability and consistency and incorporating evidence-based guidelines. Level of Evidence: Level IV—diagnostic.

求助该文献

Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions

今日热心研友