亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

From algorithms to operating room: can large language models master China’s attending anesthesiology exam? a cross-sectional evaluation

子专业 医学 麻醉学 集合(抽象数据类型) 医学教育 机器学习 计算机科学 家庭医学 病理 程序设计语言
作者
Qiyu He,Zhimin Tan,Niu Wang,Dongxu Chen,Xian Zhang,Feng Qin,Jiuhong Yuan
出处
期刊:International Journal of Surgery [Elsevier]
标识
DOI:10.1097/js9.0000000000003406
摘要

Objective: The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5, GPT-4) and DeepSeek (DeepSeek-V3, DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set AI benchmarks in medical assessments and enhance AI-driven medical education. Methods: This cross-sectional study assessed four iterations of two major LLMs on the 2025 CAAPE question bank (5,647 questions). Testing employed diverse querying strategies and languages, with subgroup analyses by subspecialty, knowledge type, and question format. The focus was on LLM performance in clinical and logical reasoning tasks, measuring accuracy, error types, and response times. Results: DeepSeek-R1 (70.6%-73.4%) and GPT-4 (68.6%-70.3%) outperformed DeepSeek-V3 (53.1%-55.5%) and GPT-3.5 (52.2%-55.7%) across all strategies. System role (SR) improved performance, while joint response degraded it. DeepSeek-R1 outperformed GPT-4 in complex subspecialties, reaching peak accuracy (73.4%) under SR combined initial response. GPT models performed better with English than Chinese queries. All models excelled in basic knowledge and Type A1 questions but struggled with clinical scenarios and advanced reasoning. Despite DeepSeek-R1’s stronger performance, its response time was longer. Errors were primarily logical and informational (over 70%), with more than half being high-risk clinical errors. Conclusion: LLMs show promise in complex clinical reasoning but risk critical errors in high-risk settings. While useful for education and decision support, their error potential must be carefully assessed in high-stakes environments.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
碧海流花完成签到,获得积分10
刚刚
arui完成签到 ,获得积分10
4秒前
7秒前
8秒前
11秒前
dongguapi发布了新的文献求助10
15秒前
NexusExplorer应助懒骨头兄采纳,获得10
18秒前
18秒前
CodeCraft应助静静采纳,获得10
19秒前
25秒前
29秒前
甜美元冬应助sfwrbh采纳,获得10
37秒前
在水一方应助MJH123456采纳,获得10
44秒前
45秒前
46秒前
NexusExplorer应助廷聿采纳,获得10
48秒前
49秒前
sfwrbh完成签到,获得积分10
49秒前
海聪天宇发布了新的文献求助10
51秒前
xclpp发布了新的文献求助10
53秒前
深情傀斗发布了新的文献求助10
54秒前
骨科小李完成签到,获得积分10
57秒前
57秒前
1分钟前
1分钟前
早日毕业脱离苦海完成签到 ,获得积分10
1分钟前
1分钟前
静静发布了新的文献求助10
1分钟前
闪闪发布了新的文献求助10
1分钟前
科研通AI2S应助科研通管家采纳,获得30
1分钟前
科研通AI2S应助科研通管家采纳,获得10
1分钟前
星辰大海应助科研通管家采纳,获得10
1分钟前
1分钟前
ding应助shimly0101xx采纳,获得10
1分钟前
1分钟前
1分钟前
1分钟前
1分钟前
1分钟前
1分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Basic And Clinical Science Course 2025-2026 3000
《药学类医疗服务价格项目立项指南(征求意见稿)》 880
花の香りの秘密―遺伝子情報から機能性まで 800
Stop Talking About Wellbeing: A Pragmatic Approach to Teacher Workload 500
Terminologia Embryologica 500
Silicon in Organic, Organometallic, and Polymer Chemistry 500
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 计算机科学 有机化学 物理 生物化学 纳米技术 复合材料 内科学 化学工程 人工智能 催化作用 遗传学 数学 基因 量子力学 物理化学
热门帖子
关注 科研通微信公众号,转发送积分 5616992
求助须知:如何正确求助?哪些是违规求助? 4701398
关于积分的说明 14913466
捐赠科研通 4747991
什么是DOI,文献DOI怎么找? 2549221
邀请新用户注册赠送积分活动 1512307
关于科研通互助平台的介绍 1474065