Artificial Intelligence for Anesthesiology Board–Style Examination Questions: Role of Large Language Models

医学麻醉学背景（考古学）一致性（知识库）医学教育病理人工智能计算机科学生物古生物学

作者

Adnan Khan,Rayaan Yunus,Mahad Sohail,Taha A. Rehman,Shirin Saeed,Yifan Bu,Cullen D. Jackson,Aidan Sharkey,Feroze Mahmood,Robina Matyal

出处

期刊：Journal of Cardiothoracic and Vascular Anesthesia [Elsevier BV]
日期：2024-02-02 卷期号：38 (5): 1251-1259 被引量：7

链接

nih.govdoi.org

标识

DOI：10.1053/j.jvca.2024.01.032

摘要

New artificial intelligence tools have been developed that have implications for medical usage. Large language models (LLMs), such as the widely used ChatGPT developed by OpenAI, have not been explored in the context of anesthesiology education. Understanding the reliability of various publicly available LLMs for medical specialties could offer insight into their understanding of the physiology, pharmacology, and practical applications of anesthesiology. An exploratory prospective review was conducted using 3 commercially available LLMs––OpenAI's ChatGPT GPT-3.5 version (GPT-3.5), OpenAI's ChatGPT GPT-4 (GPT-4), and Google's Bard––on questions from a widely used anesthesia board examination review book. Of the 884 eligible questions, the overall correct answer rates were 47.9% for GPT-3.5, 69.4% for GPT-4, and 45.2% for Bard. GPT-4 exhibited significantly higher performance than both GPT-3.5 and Bard (p = 0.001 and p < 0.001, respectively). None of the LLMs met the criteria required to secure American Board of Anesthesiology certification, according to the 70% passing score approximation. GPT-4 significantly outperformed GPT-3.5 and Bard in terms of overall performance, but lacked consistency in providing explanations that aligned with scientific and medical consensus. Although GPT-4 shows promise, current LLMs are not sufficiently advanced to answer anesthesiology board examination questions with passing success. Further iterations and domain-specific training may enhance their utility in medical education. New artificial intelligence tools have been developed that have implications for medical usage. Large language models (LLMs), such as the widely used ChatGPT developed by OpenAI, have not been explored in the context of anesthesiology education. Understanding the reliability of various publicly available LLMs for medical specialties could offer insight into their understanding of the physiology, pharmacology, and practical applications of anesthesiology. An exploratory prospective review was conducted using 3 commercially available LLMs––OpenAI's ChatGPT GPT-3.5 version (GPT-3.5), OpenAI's ChatGPT GPT-4 (GPT-4), and Google's Bard––on questions from a widely used anesthesia board examination review book. Of the 884 eligible questions, the overall correct answer rates were 47.9% for GPT-3.5, 69.4% for GPT-4, and 45.2% for Bard. GPT-4 exhibited significantly higher performance than both GPT-3.5 and Bard (p = 0.001 and p < 0.001, respectively). None of the LLMs met the criteria required to secure American Board of Anesthesiology certification, according to the 70% passing score approximation. GPT-4 significantly outperformed GPT-3.5 and Bard in terms of overall performance, but lacked consistency in providing explanations that aligned with scientific and medical consensus. Although GPT-4 shows promise, current LLMs are not sufficiently advanced to answer anesthesiology board examination questions with passing success. Further iterations and domain-specific training may enhance their utility in medical education.

求助该文献

最长约 10秒，即可获得该文献文件

Artificial Intelligence for Anesthesiology Board–Style Examination Questions: Role of Large Language Models

今日热心研友