答疑
计算机科学
自然语言处理
医学
心理学
情报检索
作者
K. K. Singhal,Tao Tu,Juraj Gottweis,Rory Sayres,Ellery Wulczyn,Mohamed Amin,Le Hou,Kevin Clark,Stephen Pfohl,Heather Cole-Lewis,Darlene Neal,Qazi Mamunur Rashid,Mike Schaekermann,Amy Wang,Dev Dash,Jonathan H. Chen,Nigam H. Shah,Sami Lachgar,P. Mansfield,Sushant Prakash
出处
期刊:Nature Medicine
[Nature Portfolio]
日期:2025-01-08
卷期号:31 (3): 943-950
被引量:659
标识
DOI:10.1038/s41591-024-03423-7
摘要
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI