梅德林
任务(项目管理)
临床试验
替代医学
医学
等值
循证医学
医学教育
心理学
随机对照试验
比例(比率)
样品(材料)
知识库
家庭医学
临床实习
临床研究设计
依赖关系(UML)
现实主义
精密医学
临床决策
英语
可读性
系统回顾
数据科学
个性化医疗
计算机科学
认知心理学
样本量测定
研究设计
作者
Sully F. Chen,Anton Alyakin,Andreas Seas,Eunice Yang,Jinhyuk Choi,Jin Vivian Lee,Amelia L. Chen,Pranav I Warman,Rochelle Bitolas,Robert Steele,Daniel A. Alber,Eric K. Oermann
出处
期刊:Nature Medicine
[Nature Portfolio]
日期:2026-03-01
卷期号:32 (3): 1152-1159
被引量:20
标识
DOI:10.1038/s41591-026-04229-5
摘要
Clinical evaluations of large language models (LLMs) have rapidly expanded since 2022, yet their evidence base remains opaque. The overwhelming volume of studies creates challenges for manual curation and review. However, LLMs themselves offer the scalability and capability to evaluate the ever-growing evidence base. This LLM-assisted review identified 4,609 peer-reviewed studies in clinical medicine between January 2022 and September 2025, equating to roughly 3.2 papers per day. Only 1,048 studies used real-world patient data and of these only 19 were prospective randomized trials; most addressed simulated scenarios (n = 1,857) or exam-style tasks (n = 1,704). ChatGPT and related OpenAI models constitute 65.7% of evaluated models, with Gemini/Bard a distant second constituting 13.1% of evaluated models. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval, and education and assessment simulation. Across 1,046 head-to-head comparisons, LLMs outperformed humans in 33% of comparisons, with a strong dependency on task realism and level of training. At least 25% of studies had sample sizes less than 30. Despite the growth of LLMs in medicine, rigorous, patient-centered evidence remains scarce, underscoring the need for larger prospective trials before clinical adoption.
科研通智能强力驱动
Strongly Powered by AbleSci AI