子专业
医学
麻醉学
集合(抽象数据类型)
医学教育
机器学习
计算机科学
家庭医学
病理
程序设计语言
作者
Qiyu He,Zhimin Tan,Niu Wang,Dongxu Chen,Xian Zhang,Feng Qin,Jiuhong Yuan
标识
DOI:10.1097/js9.0000000000003406
摘要
Objective: The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5, GPT-4) and DeepSeek (DeepSeek-V3, DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set AI benchmarks in medical assessments and enhance AI-driven medical education. Methods: This cross-sectional study assessed four iterations of two major LLMs on the 2025 CAAPE question bank (5,647 questions). Testing employed diverse querying strategies and languages, with subgroup analyses by subspecialty, knowledge type, and question format. The focus was on LLM performance in clinical and logical reasoning tasks, measuring accuracy, error types, and response times. Results: DeepSeek-R1 (70.6%-73.4%) and GPT-4 (68.6%-70.3%) outperformed DeepSeek-V3 (53.1%-55.5%) and GPT-3.5 (52.2%-55.7%) across all strategies. System role (SR) improved performance, while joint response degraded it. DeepSeek-R1 outperformed GPT-4 in complex subspecialties, reaching peak accuracy (73.4%) under SR combined initial response. GPT models performed better with English than Chinese queries. All models excelled in basic knowledge and Type A1 questions but struggled with clinical scenarios and advanced reasoning. Despite DeepSeek-R1’s stronger performance, its response time was longer. Errors were primarily logical and informational (over 70%), with more than half being high-risk clinical errors. Conclusion: LLMs show promise in complex clinical reasoning but risk critical errors in high-risk settings. While useful for education and decision support, their error potential must be carefully assessed in high-stakes environments.
科研通智能强力驱动
Strongly Powered by AbleSci AI