Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

样品(材料) 心理学 计算机科学 色谱法 化学
作者
Abbas Abolghasemi,Maqsood Ur Rehman,Syed Shakil Ur Rehman
出处
期刊:Cureus [Cureus, Inc.]
标识
DOI:10.7759/cureus.55991
摘要

Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
2秒前
ke发布了新的文献求助10
3秒前
桐桐应助Jin_Xin采纳,获得10
5秒前
5秒前
5秒前
brown完成签到,获得积分10
6秒前
李爱国应助细腻问柳采纳,获得10
6秒前
6秒前
7秒前
steve发布了新的文献求助10
9秒前
9秒前
9秒前
松鼠发布了新的文献求助10
10秒前
11秒前
赵ben山完成签到,获得积分10
13秒前
细腻问柳完成签到,获得积分10
13秒前
Yang发布了新的文献求助10
13秒前
寒冷鸭子发布了新的文献求助30
13秒前
14秒前
地平线发布了新的文献求助10
14秒前
14秒前
LQH完成签到,获得积分10
16秒前
ZHANG完成签到,获得积分20
16秒前
pluto应助科研通管家采纳,获得10
16秒前
benben应助科研通管家采纳,获得10
16秒前
cctv18应助科研通管家采纳,获得10
16秒前
天天快乐应助科研通管家采纳,获得10
16秒前
MM应助科研通管家采纳,获得10
16秒前
16秒前
17秒前
legend完成签到,获得积分10
18秒前
勤恳的烤鸡完成签到,获得积分10
18秒前
香蕉觅云应助木尧采纳,获得10
19秒前
19秒前
能干的水壶关注了科研通微信公众号
21秒前
liquss发布了新的文献求助10
22秒前
松鼠发布了新的文献求助10
22秒前
23秒前
慕青应助书晨采纳,获得30
24秒前
高分求助中
The three stars each : the Astrolabes and related texts 1070
Manual of Clinical Microbiology, 4 Volume Set (ASM Books) 13th Edition 1000
Sport in der Antike 800
Aspect and Predication: The Semantics of Argument Structure 666
De arte gymnastica. The art of gymnastics 600
少脉山油柑叶的化学成分研究 530
Sport in der Antike Hardcover – March 1, 2015 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2409099
求助须知:如何正确求助?哪些是违规求助? 2105043
关于积分的说明 5315997
捐赠科研通 1832563
什么是DOI,文献DOI怎么找? 913085
版权声明 560733
科研通“疑难数据库(出版商)”最低求助积分说明 488238