Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases

医学 医学物理学 放射科 核医学
作者
David Li,Kartik Gupta,Mousumi Bhaduri,Paul Sathiadoss,Sahir Bhatnagar,Jaron Chong
出处
期刊:Radiology [Radiological Society of North America]
卷期号:310 (1) 被引量:27
标识
DOI:10.1148/radiol.232411
摘要

HomeRadiologyVol. 310, No. 1 PreviousNext Original ResearchFree AccessComputer ApplicationsComparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please CasesDavid Li, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, Jaron Chong David Li, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, Jaron Chong Author AffiliationsFrom the Department of Medical Imaging, London Health Sciences Centre, 800 Commissioners Rd E, London, ON, Canada N6A 5A5 (D.L., M.B., P.S., J.C.); Department of Medical Imaging, Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada (K.G., J.C.); and Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Quebec, Canada (S.B.).Address correspondence to J.C. (email: [email protected]).David LiKartik GuptaMousumi BhaduriPaul SathiadossSahir BhatnagarJaron Chong Published Online:Jan 16 2024https://doi.org/10.1148/radiol.232411MoreSectionsPDF ToolsAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookXLinked In AbstractDownload as PowerPointIntroductionLarge language models (LLMs), such as generative pretrained transformers (GPTs), have garnered attention in the past year due to their remarkable capacity to comprehend and generate human-like text, with perhaps the most well-known being ChatGPT (1). However, it remains unquantified to what extent advancements in successive GPT generations translate into enhanced diagnostic accuracy for radiology cases. This investigation aims to evaluate the diagnostic accuracy of GPT-3.5 and GPT-4 (OpenAI) in solving text-based Radiology Diagnosis Please cases. GPT-4 is the successor to GPT-3.5 and has demonstrated substantial improvements on numerous academic examinations (2).Materials and MethodsThis study adheres to the Checklist for Artificial Intelligence in Medical Imaging and was exempt from institutional review board review due to the use of public data (3). A retrospective analysis of Radiology Diagnosis Please cases from August 1998 to July 2023 was performed. The clinical history, imaging findings, and ground truth diagnosis were extracted. Cases disclosing the diagnosis were excluded. Diagnostic accuracy of the March and June 2023 snapshots (ie, a specific model version from a point in time) of GPT-3.5 (4) and GPT-4 (5) were assessed using the top five differential diagnoses generated from text inputs of history, findings, and both combined, with imaging findings originally characterized by radiologists. Default hyperparameters were applied, except for a temperature of 0 to maximize determinism. Three radiologists (J.C., P.S., and M.B., with 8, 8, and 23 years of experience, respectively) evaluated generated differentials, with discrepancies resolved by means of mediated discussion. A generalized estimating equation linear probability model with an exchangeable correlation structure was fit to estimate the time-dependent effects and 95% CIs of snapshot version on diagnostic accuracy, with adjustment for subspecialty.ResultsOf 315 cases, 28 were excluded due to disclosed diagnoses for a final sample of 287 cases. Overall, GPT-4’s accuracy improved significantly compared with GPT-3.5 by 19.8 percentage points (95% CI: 15, 25) in March and 11.1 percentage points (95% CI: 6, 17) in June (Tables 1, 2). Within models, for GPT-4, from March to June, there was a statistically significant decrease in accuracy (accuracy, −5.92 percentage points [95% CI: −10, −2]). For GPT-3.5, from March to June, there was an increase in accuracy that was not statistically significant (accuracy, +2.79 percentage points [95% CI: −1, 6]). Of the 10 subspecialties, with breast imaging as reference, the only subspecialty significantly associated with greater accuracy was head and neck cases (generalized estimating equation estimate: 0.428 [95% CI: 0.10, 0.76]). Across all subspecialties and snapshots, the average increase in diagnostic accuracy from GPT-3.5 to GPT-4 was +17.3% (SD, 15.6%; minimum, −11.5%; maximum, +50.0%) (Figure).Table 1: Overall and Per-Subspecialty Diagnostic Accuracy of GPT-3.5 and GPT-4 June 2023 Snapshots on 287 Radiology Diagnosis Please CasesTable 2: Overall and Per-Subspecialty Diagnostic Accuracy of GPT-3.5 and GPT-4 March 2023 Snapshots on 287 Radiology Diagnosis Please CasesComparison stacked bar charts of diagnostic accuracy between March and June 2023 snapshots of GPT-3.5 and GPT-4 on 287 Radiology Diagnosis Please cases using text-based clinical history and findings. (A) For GPT-3.5, the diagnostic accuracy increased in five of 10 subspecialties, remained unchanged in three subspecialties, and decreased in two subspecialties between the March and June 2023 snapshots. (B) For GPT-4, the diagnostic accuracy increased in two of 10 subspecialties, remained unchanged in two subspecialties, and decreased in six subspecialties between the March and June 2023 snapshots. BR = breast, CH = chest, CV = cardiovascular, GI = gastrointestinal, GU = genitourinary, HN = head and neck, MSK = musculoskeletal, NR = neuroradiology, OB = obstetric, PD = pediatric.Download as PowerPointDiscussionDiagnosis Please cases could serve as a test for gauging performance drift, or changes in model performance over time, as they simulate complex, challenging, real-world clinical scenarios (6). Our results suggest performance drift between the March and June snapshots of GPT-3.5 and GPT-4. The overall increase in diagnostic accuracy between GPT-3.5 and GPT-4 moderately parallels that seen in other academic and professional examinations (2). If future LLMs exhibit similar performance increases, we anticipate that accuracy on Diagnosis Please cases may continue to increase, even without radiology-specific fine-tuning.Our investigation demonstrated unexpected findings, notably that there was a statistically significant decrease in the diagnostic accuracy of the GPT-4 June snapshot. This observation echoes similar reports of GPT-4’s performance varying between snapshots (7). This variability could stem from optimization on competing metrics, such as safety or inference speed, potentially leading to instability in real-world performance. Despite differences between this experimental setting and clinical practice, LLMs could potentially serve as a decision support tool in future diagnostic workflows, particularly for creatively broadening differential diagnoses under supervision by radiologists. Our study highlights the pressing need for more robust and continuous LLM monitoring systems before clinical deployment.Disclosures of conflicts of interest: D.L. No relevant relationships. K.G. No relevant relationships. M.B. No relevant relationships. P.S. No relevant relationships. S.B. No relevant relationships. J.C. Member of the Health Canada Scientific Advisory Committee for Digital Health Technologies; chair of the Canadian Association of Radiologists AI Standing Committee.Author ContributionsAuthor contributions: Guarantors of integrity of entire study, K.G., J.C.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, D.L., K.G.; clinical studies, K.G.; experimental studies, D.L., K.G., M.B., P.S., J.C.; statistical analysis, D.L., K.G., S.B.; and manuscript editing, D.L., K.G., J.C.References1. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023;307(5):e230582. Link, Google Scholar2. OpenAI. GPT-4 Technical Report. arXiv 2303.08774 [preprint] https://arxiv.org/abs/2303.08774. Posted March 15, 2023. Updated March 27, 2023. Accessed September 2023. Google Scholar3. Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell 2020;2(2):e200029. Link, Google Scholar4. GPT-3.5. OpenAI. https://platform.openai.com/docs/models/gpt-3-5. Accessed August 12, 2023. Google Scholar5. GPT-4. OpenAI. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo. Accessed August 12, 2023. Google Scholar6. Ueda D, Mitsuyama Y, Takita H, et al. ChatGPT’s diagnostic performance from patient history and imaging findings on the Diagnosis Please quizzes. Radiology 2023;308(1):e231040. Link, Google Scholar7. Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? arXiv 2307.09009 [preprint] https://arxiv.org/abs/2307.09009. Posted July 18, 2023. Updated October 31, 2023. Accessed September 2023. Google ScholarArticle HistoryReceived: Sept 8 2023Revision requested: Oct 6 2023Revision received: Nov 24 2023Accepted: Dec 4 2023Published online: Jan 16 2024 FiguresReferencesRelatedDetailsRecommended Articles Real-time Correction of Motion and Imager Instability Artifacts during 3D γ-Aminobutyric Acid–edited MR Spectroscopic ImagingRadiology2017Volume: 286Issue: 2pp. 666-675Radiomic Phenotypes of Mammographic Parenchymal Complexity: Toward Augmenting Breast Density in Breast Cancer Risk AssessmentRadiology2018Volume: 290Issue: 1pp. 41-49U.S. Diagnostic Reference Levels and Achievable Doses for 10 Pediatric CT ExaminationsRadiology2021Volume: 302Issue: 1pp. 164-174Synthetic MRI of the Knee: Phantom Validation and Comparison with Conventional MRIRadiology2018Volume: 289Issue: 2pp. 465-477Pediatric Chest CT Diagnostic Reference Ranges: Development and ApplicationRadiology2017Volume: 284Issue: 1pp. 219-227See More RSNA Education Exhibits Patellofemoral Instability? What the Surgeon Wants to KnowDigital Posters2019Breast Imaging In Children And Adolescents: From Science To Practice - A Pictorial Essay.Digital Posters2021Judging A Bone By Its Cover: A Radiologist's Guide To Periosteal PathologyDigital Posters2021 RSNA Case Collection Male Breast CancerRSNA Case Collection2021Lactating adenomaRSNA Case Collection2020Male breast cancerRSNA Case Collection2020 Vol. 310, No. 1 Metrics Altmetric Score PDF download
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
1秒前
1秒前
田様应助DreamerKing采纳,获得10
1秒前
在水一方应助xx采纳,获得10
1秒前
是容与呀完成签到,获得积分10
2秒前
2秒前
2秒前
英俊的铭应助我我我采纳,获得10
2秒前
rainkz发布了新的文献求助10
3秒前
多情的青曼完成签到,获得积分10
3秒前
mahliya完成签到,获得积分10
3秒前
Jasper应助louiselong采纳,获得10
3秒前
小蘑菇应助凤尾鱼采纳,获得10
3秒前
甜甜映波发布了新的文献求助10
4秒前
kkkkk完成签到,获得积分10
4秒前
4秒前
搜集达人应助你吃饱了吗采纳,获得10
4秒前
4秒前
Caer发布了新的文献求助10
5秒前
Tayzon完成签到,获得积分10
5秒前
元谷雪发布了新的文献求助10
5秒前
干净青发布了新的文献求助10
5秒前
gehao发布了新的文献求助10
5秒前
丘比特应助朝阳任贤齐采纳,获得10
5秒前
友好的半仙完成签到 ,获得积分10
6秒前
6秒前
6秒前
从容的星月完成签到,获得积分10
6秒前
I_won_t完成签到,获得积分10
6秒前
Huang_being完成签到,获得积分20
6秒前
李明完成签到,获得积分10
7秒前
坚定背包发布了新的文献求助10
7秒前
乐乐茶完成签到,获得积分10
8秒前
8秒前
无花果应助从容灵松采纳,获得10
8秒前
I_won_t发布了新的文献求助10
9秒前
9秒前
9秒前
9秒前
高分求助中
Technologies supporting mass customization of apparel: A pilot project 600
Chinesen in Europa – Europäer in China: Journalisten, Spione, Studenten 500
Arthur Ewert: A Life for the Comintern 500
China's Relations With Japan 1945-83: The Role of Liao Chengzhi // Kurt Werner Radtke 500
Two Years in Peking 1965-1966: Book 1: Living and Teaching in Mao's China // Reginald Hunt 500
Understanding Interaction in the Second Language Classroom Context 300
Essentials of Pharmacoeconomics: Health Economics and Outcomes Research 3rd Edition. by Karen Rascati 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3809784
求助须知:如何正确求助?哪些是违规求助? 3354374
关于积分的说明 10369891
捐赠科研通 3070592
什么是DOI,文献DOI怎么找? 1686492
邀请新用户注册赠送积分活动 810967
科研通“疑难数据库(出版商)”最低求助积分说明 766448