Evaluating large language models on a highly-specialized topic, radiation oncology physics

一致性(知识库) 考试(生物学) 放射肿瘤学 水准点(测量) 医学教育 心理学 医学物理学 肿瘤科 医学 内科学 计算机科学 生物 放射治疗 人工智能 生态学 大地测量学 地理
作者
Jason Holmes,Zhengliang Liu,Lian Zhang,Yong Ding,Terence T. Sio,L.A. McGee,Jonathan B. Ashman,Xiang Li,Tianming Liu,Jiajian Shen,Wei Liu
出处
期刊:Frontiers in Oncology [Frontiers Media SA]
卷期号:13 被引量:32
标识
DOI:10.3389/fonc.2023.1219326
摘要

We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with "None of the above choices is the correct answer."). A majority vote analysis was used to approximate how well each group could score when working together.ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
Robin95发布了新的文献求助10
4秒前
5秒前
简单发布了新的文献求助10
6秒前
6秒前
6秒前
7秒前
Skywalker完成签到,获得积分10
8秒前
8秒前
开心绿柳发布了新的文献求助10
11秒前
yyy1234567发布了新的文献求助10
12秒前
12秒前
14秒前
流芳发布了新的文献求助10
15秒前
18秒前
18秒前
小二郎应助科研通管家采纳,获得10
18秒前
科研通AI2S应助科研通管家采纳,获得10
18秒前
SONGYEZI应助科研通管家采纳,获得10
18秒前
geogydeniel发布了新的文献求助20
18秒前
SOLOMON应助科研通管家采纳,获得10
18秒前
Lucas应助科研通管家采纳,获得10
18秒前
隐形曼青应助科研通管家采纳,获得10
19秒前
JamesPei应助科研通管家采纳,获得10
19秒前
顾矜应助科研通管家采纳,获得10
19秒前
19秒前
科研通AI2S应助科研通管家采纳,获得10
19秒前
华仔应助科研通管家采纳,获得10
19秒前
19秒前
19秒前
Lee发布了新的文献求助10
19秒前
隐形曼青应助lijiajia采纳,获得10
20秒前
任性的水风完成签到,获得积分10
22秒前
搜集达人应助Light采纳,获得10
22秒前
天天快乐应助寂寞的强炫采纳,获得10
27秒前
27秒前
今后应助科研菜狗采纳,获得10
28秒前
结实星星完成签到,获得积分0
29秒前
思源应助蔡佰航采纳,获得10
29秒前
可爱的函函应助serena0_0采纳,获得10
30秒前
shinysparrow应助口腔医生采纳,获得10
30秒前
高分求助中
Formgebungs- und Stabilisierungsparameter für das Konstruktionsverfahren der FiDU-Freien Innendruckumformung von Blech 1000
The Illustrated History of Gymnastics 800
The Bourse of Babylon : market quotations in the astronomical diaries of Babylonia 680
Herman Melville: A Biography (Volume 1, 1819-1851) 600
Division and square root. Digit-recurrence algorithms and implementations 500
機能營養學前瞻(3 Ed.) 300
Improving the ductility and toughness of Fe-Cr-B cast irons 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2508700
求助须知:如何正确求助?哪些是违规求助? 2159361
关于积分的说明 5528577
捐赠科研通 1879861
什么是DOI,文献DOI怎么找? 935346
版权声明 564126
科研通“疑难数据库(出版商)”最低求助积分说明 499433