Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases

医学医学物理学放射科核医学

作者

David Li,Kartik Gupta,Mousumi Bhaduri,Paul Sathiadoss,Sahir Bhatnagar,Jaron Chong

出处

期刊：Radiology [Radiological Society of North America]
日期：2024-01-01 卷期号：310 (1) 被引量：27

链接

nih.govdoi.org

标识

DOI：10.1148/radiol.232411

摘要

HomeRadiologyVol. 310, No. 1 PreviousNext Original ResearchFree AccessComputer ApplicationsComparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please CasesDavid Li, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, Jaron Chong David Li, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, Jaron Chong Author AffiliationsFrom the Department of Medical Imaging, London Health Sciences Centre, 800 Commissioners Rd E, London, ON, Canada N6A 5A5 (D.L., M.B., P.S., J.C.); Department of Medical Imaging, Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada (K.G., J.C.); and Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Quebec, Canada (S.B.).Address correspondence to J.C. (email: [email protected]).David LiKartik GuptaMousumi BhaduriPaul SathiadossSahir BhatnagarJaron Chong Published Online:Jan 16 2024https://doi.org/10.1148/radiol.232411MoreSectionsPDF ToolsAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookXLinked In AbstractDownload as PowerPointIntroductionLarge language models (LLMs), such as generative pretrained transformers (GPTs), have garnered attention in the past year due to their remarkable capacity to comprehend and generate human-like text, with perhaps the most well-known being ChatGPT (1). However, it remains unquantified to what extent advancements in successive GPT generations translate into enhanced diagnostic accuracy for radiology cases. This investigation aims to evaluate the diagnostic accuracy of GPT-3.5 and GPT-4 (OpenAI) in solving text-based Radiology Diagnosis Please cases. GPT-4 is the successor to GPT-3.5 and has demonstrated substantial improvements on numerous academic examinations (2).Materials and MethodsThis study adheres to the Checklist for Artificial Intelligence in Medical Imaging and was exempt from institutional review board review due to the use of public data (3). A retrospective analysis of Radiology Diagnosis Please cases from August 1998 to July 2023 was performed. The clinical history, imaging findings, and ground truth diagnosis were extracted. Cases disclosing the diagnosis were excluded. Diagnostic accuracy of the March and June 2023 snapshots (ie, a specific model version from a point in time) of GPT-3.5 (4) and GPT-4 (5) were assessed using the top five differential diagnoses generated from text inputs of history, findings, and both combined, with imaging findings originally characterized by radiologists. Default hyperparameters were applied, except for a temperature of 0 to maximize determinism. Three radiologists (J.C., P.S., and M.B., with 8, 8, and 23 years of experience, respectively) evaluated generated differentials, with discrepancies resolved by means of mediated discussion. A generalized estimating equation linear probability model with an exchangeable correlation structure was fit to estimate the time-dependent effects and 95% CIs of snapshot version on diagnostic accuracy, with adjustment for subspecialty.ResultsOf 315 cases, 28 were excluded due to disclosed diagnoses for a final sample of 287 cases. Overall, GPT-4’s accuracy improved significantly compared with GPT-3.5 by 19.8 percentage points (95% CI: 15, 25) in March and 11.1 percentage points (95% CI: 6, 17) in June (Tables 1, 2). Within models, for GPT-4, from March to June, there was a statistically significant decrease in accuracy (accuracy, −5.92 percentage points [95% CI: −10, −2]). For GPT-3.5, from March to June, there was an increase in accuracy that was not statistically significant (accuracy, +2.79 percentage points [95% CI: −1, 6]). Of the 10 subspecialties, with breast imaging as reference, the only subspecialty significantly associated with greater accuracy was head and neck cases (generalized estimating equation estimate: 0.428 [95% CI: 0.10, 0.76]). Across all subspecialties and snapshots, the average increase in diagnostic accuracy from GPT-3.5 to GPT-4 was +17.3% (SD, 15.6%; minimum, −11.5%; maximum, +50.0%) (Figure).Table 1: Overall and Per-Subspecialty Diagnostic Accuracy of GPT-3.5 and GPT-4 June 2023 Snapshots on 287 Radiology Diagnosis Please CasesTable 2: Overall and Per-Subspecialty Diagnostic Accuracy of GPT-3.5 and GPT-4 March 2023 Snapshots on 287 Radiology Diagnosis Please CasesComparison stacked bar charts of diagnostic accuracy between March and June 2023 snapshots of GPT-3.5 and GPT-4 on 287 Radiology Diagnosis Please cases using text-based clinical history and findings. (A) For GPT-3.5, the diagnostic accuracy increased in five of 10 subspecialties, remained unchanged in three subspecialties, and decreased in two subspecialties between the March and June 2023 snapshots. (B) For GPT-4, the diagnostic accuracy increased in two of 10 subspecialties, remained unchanged in two subspecialties, and decreased in six subspecialties between the March and June 2023 snapshots. BR = breast, CH = chest, CV = cardiovascular, GI = gastrointestinal, GU = genitourinary, HN = head and neck, MSK = musculoskeletal, NR = neuroradiology, OB = obstetric, PD = pediatric.Download as PowerPointDiscussionDiagnosis Please cases could serve as a test for gauging performance drift, or changes in model performance over time, as they simulate complex, challenging, real-world clinical scenarios (6). Our results suggest performance drift between the March and June snapshots of GPT-3.5 and GPT-4. The overall increase in diagnostic accuracy between GPT-3.5 and GPT-4 moderately parallels that seen in other academic and professional examinations (2). If future LLMs exhibit similar performance increases, we anticipate that accuracy on Diagnosis Please cases may continue to increase, even without radiology-specific fine-tuning.Our investigation demonstrated unexpected findings, notably that there was a statistically significant decrease in the diagnostic accuracy of the GPT-4 June snapshot. This observation echoes similar reports of GPT-4’s performance varying between snapshots (7). This variability could stem from optimization on competing metrics, such as safety or inference speed, potentially leading to instability in real-world performance. Despite differences between this experimental setting and clinical practice, LLMs could potentially serve as a decision support tool in future diagnostic workflows, particularly for creatively broadening differential diagnoses under supervision by radiologists. Our study highlights the pressing need for more robust and continuous LLM monitoring systems before clinical deployment.Disclosures of conflicts of interest: D.L. No relevant relationships. K.G. No relevant relationships. M.B. No relevant relationships. P.S. No relevant relationships. S.B. No relevant relationships. J.C. Member of the Health Canada Scientific Advisory Committee for Digital Health Technologies; chair of the Canadian Association of Radiologists AI Standing Committee.Author ContributionsAuthor contributions: Guarantors of integrity of entire study, K.G., J.C.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, D.L., K.G.; clinical studies, K.G.; experimental studies, D.L., K.G., M.B., P.S., J.C.; statistical analysis, D.L., K.G., S.B.; and manuscript editing, D.L., K.G., J.C.References1. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023;307(5):e230582. Link, Google Scholar2. OpenAI. GPT-4 Technical Report. arXiv 2303.08774 [preprint] https://arxiv.org/abs/2303.08774. Posted March 15, 2023. Updated March 27, 2023. Accessed September 2023. Google Scholar3. Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell 2020;2(2):e200029. Link, Google Scholar4. GPT-3.5. OpenAI. https://platform.openai.com/docs/models/gpt-3-5. Accessed August 12, 2023. Google Scholar5. GPT-4. OpenAI. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo. Accessed August 12, 2023. Google Scholar6. Ueda D, Mitsuyama Y, Takita H, et al. ChatGPT’s diagnostic performance from patient history and imaging findings on the Diagnosis Please quizzes. Radiology 2023;308(1):e231040. Link, Google Scholar7. Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? arXiv 2307.09009 [preprint] https://arxiv.org/abs/2307.09009. Posted July 18, 2023. Updated October 31, 2023. Accessed September 2023. Google ScholarArticle HistoryReceived: Sept 8 2023Revision requested: Oct 6 2023Revision received: Nov 24 2023Accepted: Dec 4 2023Published online: Jan 16 2024 FiguresReferencesRelatedDetailsRecommended Articles Real-time Correction of Motion and Imager Instability Artifacts during 3D γ-Aminobutyric Acid–edited MR Spectroscopic ImagingRadiology2017Volume: 286Issue: 2pp. 666-675Radiomic Phenotypes of Mammographic Parenchymal Complexity: Toward Augmenting Breast Density in Breast Cancer Risk AssessmentRadiology2018Volume: 290Issue: 1pp. 41-49U.S. Diagnostic Reference Levels and Achievable Doses for 10 Pediatric CT ExaminationsRadiology2021Volume: 302Issue: 1pp. 164-174Synthetic MRI of the Knee: Phantom Validation and Comparison with Conventional MRIRadiology2018Volume: 289Issue: 2pp. 465-477Pediatric Chest CT Diagnostic Reference Ranges: Development and ApplicationRadiology2017Volume: 284Issue: 1pp. 219-227See More RSNA Education Exhibits Patellofemoral Instability? What the Surgeon Wants to KnowDigital Posters2019Breast Imaging In Children And Adolescents: From Science To Practice - A Pictorial Essay.Digital Posters2021Judging A Bone By Its Cover: A Radiologist's Guide To Periosteal PathologyDigital Posters2021 RSNA Case Collection Male Breast CancerRSNA Case Collection2021Lactating adenomaRSNA Case Collection2020Male breast cancerRSNA Case Collection2020 Vol. 310, No. 1 Metrics Altmetric Score PDF download

求助该文献

最长约 10秒，即可获得该文献文件

Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases

今日热心研友