Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination

医学 重复性 置信区间 可靠性(半导体) 稳健性(进化) 医学物理学 放射科 统计 内科学 功率(物理) 物理 生物化学 数学 化学 量子力学 基因
作者
Satheesh Krishna,Nishaant Bhambra,Robert R. Bleakney,Rajesh Bhayana,Sarah Atzen
出处
期刊:Radiology [Radiological Society of North America]
卷期号:311 (2) 被引量:25
标识
DOI:10.1148/radiol.232715
摘要

Background ChatGPT (OpenAI) can pass a text-based radiology board–style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board–style examination. Materials and Methods In this exploratory prospective study, 150 radiology board–style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1–10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1–10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
Goomo完成签到 ,获得积分10
2秒前
踏雪完成签到 ,获得积分10
4秒前
831143完成签到 ,获得积分0
5秒前
5秒前
11秒前
Dreammy完成签到,获得积分10
12秒前
13秒前
amyself完成签到,获得积分20
13秒前
78888完成签到 ,获得积分10
15秒前
我爱科研完成签到,获得积分10
16秒前
amyself发布了新的文献求助30
17秒前
潜龙完成签到 ,获得积分10
18秒前
wsj完成签到,获得积分10
18秒前
doclarrin完成签到 ,获得积分0
20秒前
wang完成签到 ,获得积分10
21秒前
Garfield完成签到 ,获得积分10
22秒前
zhangyx完成签到 ,获得积分0
22秒前
害羞含卉完成签到,获得积分10
23秒前
星辉的斑斓完成签到 ,获得积分10
26秒前
Orange应助雪霁采纳,获得10
26秒前
26秒前
东方天奇完成签到 ,获得积分10
29秒前
wang完成签到 ,获得积分10
33秒前
祁乾完成签到 ,获得积分10
33秒前
畅快鞅完成签到 ,获得积分10
34秒前
我爱科研发布了新的文献求助10
36秒前
王俊1314完成签到 ,获得积分10
39秒前
zhaolee完成签到 ,获得积分0
50秒前
苑世朝完成签到,获得积分10
51秒前
呆萌芙蓉完成签到 ,获得积分10
53秒前
帅气的祥完成签到,获得积分10
57秒前
俊秀的千万完成签到,获得积分10
1分钟前
吴晨曦完成签到,获得积分10
1分钟前
Army616完成签到,获得积分10
1分钟前
风趣从霜完成签到,获得积分10
1分钟前
lh完成签到 ,获得积分10
1分钟前
我思故我在完成签到,获得积分0
1分钟前
如意硬币完成签到 ,获得积分10
1分钟前
1分钟前
蓝精灵完成签到 ,获得积分10
1分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
The Cambridge History of China: Volume 4, Sui and T'ang China, 589–906 AD, Part Two 1500
Cowries - A Guide to the Gastropod Family Cypraeidae 1200
Quality by Design - An Indispensable Approach to Accelerate Biopharmaceutical Product Development 800
Pulse width control of a 3-phase inverter with non sinusoidal phase voltages 777
Signals, Systems, and Signal Processing 610
Research Methods for Applied Linguistics: A Practical Guide 600
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6399496
求助须知:如何正确求助?哪些是违规求助? 8216166
关于积分的说明 17408022
捐赠科研通 5452760
什么是DOI,文献DOI怎么找? 2881908
邀请新用户注册赠送积分活动 1858342
关于科研通互助平台的介绍 1700339