Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

骨科手术医学主题（文档）主题订单（交换）内科学外科人工智能心理学图书馆学计算机科学课程财务教育学经济

作者

Andrew Xu,Manjot Singh,Mariah Balmaceno-Criss,A. Oh,Don Leigh,Mohammad Daher,Daniel Alsoof,Alan H. Daniels,Bassel G. Diebo,Alan H. Daniels

出处

期刊：Journal of orthopaedic surgery [SAGE]
日期：2025-01-01 卷期号：33 (1): 10225536241268789-10225536241268789 被引量：5

链接

doi.org nih.gov doaj.orgdoi.org

标识

DOI：10.1177/10225536241268789

摘要

Background Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 ( p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard ( p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 ( p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 ( p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated ( p = .003 and p < .001, respectively) and higher-order questions ( p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions ( p = .045). Conclusion The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.

求助该文献

最长约 10秒，即可获得该文献文件

Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

今日热心研友