Large Language Models Encode Clinical Knowledge

计算机科学 水准点(测量) 危害 人工智能 机器学习 钥匙(锁) 数据科学 语言模型 心理学 计算机安全 大地测量学 社会心理学 地理
作者
Karan Singhal,Shekoofeh Azizi,Tao Tu,Sara Mahdavi,Jason Lee,Hyung Won Chung,Nathan Scales,Ajay Kumar Tanwani,Heather Cole-Lewis,Stephen Pfohl,Perry W. Payne,Martin Seneviratne,Paul Gamble,Christopher B. Kelly,Nathaneal Scharli,Aakanksha Chowdhery,Philip Mansfield,Blaise Agüera y Arcas,D. R. Webster,Greg S. Corrado,Yossi Matias,Katherine Chou,Juraj Gottweis,Nenad Tomašev,Yun Liu,Alvin Rajkomar,Joëlle K. Barral,Christopher Semturs,Alan Karthikesalingam,Vivek Natarajan
出处
期刊:Cornell University - arXiv 被引量:15
标识
DOI:10.48550/arxiv.2212.13138
摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
无花果应助会飞的鸟采纳,获得10
刚刚
cdh1994应助闪烁采纳,获得20
1秒前
zzz4743应助dyfsj采纳,获得30
1秒前
1秒前
哈哈完成签到,获得积分10
1秒前
111完成签到,获得积分10
2秒前
2秒前
3秒前
3秒前
百香果bxg完成签到 ,获得积分10
4秒前
伊一发布了新的文献求助10
4秒前
Owen应助秒梦采纳,获得30
4秒前
123发布了新的文献求助10
5秒前
研友_VZG7GZ应助虚幻的莞采纳,获得10
5秒前
May发布了新的文献求助10
5秒前
哈哈发布了新的文献求助10
6秒前
6秒前
赘婿应助YQ采纳,获得10
6秒前
6秒前
冷酷白昼完成签到 ,获得积分10
6秒前
开朗篮球发布了新的文献求助10
6秒前
Orange应助积极墨镜采纳,获得30
6秒前
7秒前
7秒前
cdh1994应助研友_Z6kEQ8采纳,获得10
7秒前
长安发布了新的文献求助10
7秒前
科研通AI2S应助化学位移值采纳,获得10
8秒前
SWD发布了新的文献求助10
9秒前
安排123发布了新的文献求助10
10秒前
赘婿应助Atari采纳,获得10
10秒前
会飞的鸟发布了新的文献求助10
10秒前
10秒前
大黄人完成签到,获得积分20
11秒前
12秒前
依古比古发布了新的文献求助10
12秒前
13秒前
13秒前
shitou发布了新的文献求助10
13秒前
深情安青应助伊一采纳,获得10
13秒前
14秒前
高分求助中
【本贴是提醒信息,请勿应助】请在求助之前详细阅读求助说明!!!! 20000
One Man Talking: Selected Essays of Shao Xunmei, 1929–1939 1000
The Three Stars Each: The Astrolabes and Related Texts 900
Yuwu Song, Biographical Dictionary of the People's Republic of China 800
Multifunctional Agriculture, A New Paradigm for European Agriculture and Rural Development 600
Challenges, Strategies, and Resiliency in Disaster and Risk Management 500
Bernd Ziesemer - Maos deutscher Topagent: Wie China die Bundesrepublik eroberte 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2481326
求助须知:如何正确求助?哪些是违规求助? 2144104
关于积分的说明 5468299
捐赠科研通 1866532
什么是DOI,文献DOI怎么找? 927659
版权声明 563032
科研通“疑难数据库(出版商)”最低求助积分说明 496371