Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

医学医学物理学自然语言处理计算机科学

作者

Brendin R. Beaulieu‐Jones,Margaret T. Berrigan,Sahaj S. Shah,Jayson S. Marwaha,Shuo-Lun Lai,Gabriel A. Brat

出处

期刊：Surgery [Elsevier BV]
日期：2024-01-21 卷期号：175 (4): 936-942 被引量：18

链接

nih.gov nih.govdoi.org

标识

DOI：10.1016/j.surg.2023.12.014

摘要

Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries.A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions.Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

求助该文献

最长约 10秒，即可获得该文献文件

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

今日热心研友