清晨好,您是今天最早来到科研通的研友!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您科研之路漫漫前行!

Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination

医学 重复性 置信区间 可靠性(半导体) 稳健性(进化) 医学物理学 放射科 统计 内科学 数学 生物化学 量子力学 基因 物理 功率(物理) 化学
作者
Satheesh Krishna,Nishaant Bhambra,Robert R. Bleakney,Rajesh Bhayana,Sarah Atzen
出处
期刊:Radiology [Radiological Society of North America]
卷期号:311 (2) 被引量:25
标识
DOI:10.1148/radiol.232715
摘要

Background ChatGPT (OpenAI) can pass a text-based radiology board–style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board–style examination. Materials and Methods In this exploratory prospective study, 150 radiology board–style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1–10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1–10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
广阔天地完成签到 ,获得积分10
30秒前
madison完成签到 ,获得积分10
1分钟前
zzgpku完成签到,获得积分0
1分钟前
俊逸的白梦完成签到 ,获得积分0
1分钟前
Tong完成签到,获得积分0
1分钟前
眯眯眼的安雁完成签到 ,获得积分10
1分钟前
生信小菜鸟完成签到 ,获得积分10
1分钟前
爆米花应助古月采纳,获得10
2分钟前
Wang完成签到 ,获得积分20
2分钟前
曲沉鱼发布了新的文献求助10
2分钟前
2分钟前
00粥发布了新的文献求助10
2分钟前
CherylZhao完成签到,获得积分10
2分钟前
火之高兴完成签到 ,获得积分10
2分钟前
阿明完成签到,获得积分10
3分钟前
lingling完成签到 ,获得积分10
3分钟前
yuna_yqc完成签到 ,获得积分10
3分钟前
jlwang完成签到,获得积分10
3分钟前
keyan完成签到 ,获得积分10
3分钟前
00粥完成签到,获得积分10
3分钟前
如意2023完成签到 ,获得积分10
3分钟前
3分钟前
秋夜临完成签到,获得积分10
3分钟前
古月发布了新的文献求助10
3分钟前
曲沉鱼完成签到,获得积分20
3分钟前
5433完成签到 ,获得积分10
3分钟前
南风完成签到 ,获得积分10
3分钟前
搬砖的化学男完成签到 ,获得积分0
3分钟前
钱念波发布了新的文献求助50
3分钟前
彩色映雁完成签到 ,获得积分10
3分钟前
海阔天空完成签到 ,获得积分10
3分钟前
淡淡从阳完成签到,获得积分10
4分钟前
4分钟前
4分钟前
ARIA完成签到 ,获得积分10
4分钟前
古炮完成签到 ,获得积分10
4分钟前
Gary完成签到 ,获得积分10
4分钟前
CHEN完成签到 ,获得积分10
4分钟前
沉默的友安完成签到 ,获得积分10
5分钟前
fogsea完成签到,获得积分0
5分钟前
高分求助中
Encyclopedia of Mathematical Physics 2nd edition 888
Technologies supporting mass customization of apparel: A pilot project 600
Introduction to Strong Mixing Conditions Volumes 1-3 500
Tip60 complex regulates eggshell formation and oviposition in the white-backed planthopper, providing effective targets for pest control 400
Optical and electric properties of monocrystalline synthetic diamond irradiated by neutrons 320
共融服務學習指南 300
Essentials of Pharmacoeconomics: Health Economics and Outcomes Research 3rd Edition. by Karen Rascati 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3804223
求助须知:如何正确求助?哪些是违规求助? 3349045
关于积分的说明 10341160
捐赠科研通 3065188
什么是DOI,文献DOI怎么找? 1682974
邀请新用户注册赠送积分活动 808571
科研通“疑难数据库(出版商)”最低求助积分说明 764600