摘要
HomeRadiologyVol. 309, No. 2 PreviousNext Original ResearchFree AccessComputer ApplicationsAccuracy of ChatGPT, Google Bard, and Microsoft Bing for Simplifying Radiology ReportsKanhai S. Amin, Melissa A. Davis, Rushabh Doshi, Andrew H. Haims, Pavan Khosla, Howard P. Forman Kanhai S. Amin, Melissa A. Davis, Rushabh Doshi, Andrew H. Haims, Pavan Khosla, Howard P. Forman Author AffiliationsFrom the Department of Radiology and Biomedical Imaging, Yale School of Medicine, 333 Cedar St, New Haven, CT 06520.Address correspondence to H.P.F. (email: [email protected]).Kanhai S. AminMelissa A. DavisRushabh DoshiAndrew H. HaimsPavan KhoslaHoward P. Forman Published Online:Nov 21 2023https://doi.org/10.1148/radiol.232561MoreSectionsPDF ToolsAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinked In AbstractDownload as PowerPointIntroductionWith the advent of the Office of the National Coordinator for Health Information Technology’s Cures Act Final Rule and its information blocking provision, radiology reports have become increasingly accessible to patients (1). However, patients may not be able to understand their reports due to many factors, including radiology-specific jargon. This can lead to increased patient anxiety and call volume to providers (2). Many solutions, such as providing a lay summary or second report in lay language, have been proposed (2). Emerging technologies such as large language models (LLMs) powered by natural language processing (NLP) can generate these additional lay language materials without significantly hindering the radiologist workflow. While these technologies may soon be used on the provider side, patients are already engaging with publicly available LLMs: ChatGPT alone has more than 100 million users (3).One study (4) has demonstrated that four publicly available LLMs—ChatGPT-3.5 (5), GPT-4 (6), Google Bard (7), and Microsoft Bing (8)—can significantly simplify radiology reports. The present work assesses the accuracy of the four LLMs when asked the basic prompt “Simplify this radiology report.”Materials and MethodsFrom 750 radiology report impressions—gathered from the de-identified, publicly available, and Health Insurance Portability and Accountability Act–compliant MIMIC-IV database (9)—assessed in a previous article (4), we randomly selected 150 impressions (30 from CT, 30 from mammography, 30 from MRI, 30 from US, and 30 from radiography) and their simplified output. The average reading grade level (aRGL) was reassessed for this subset of reports by averaging grade-level scores calculated with the Gunning Fog Index, Flesch-Kincaid Grade Level Readability Calculator, Automated Readability Index, and Coleman-Liau Readability Index (4).Two radiology attending physicians (M.A.D. and A.H.H., with 9 and 24 years of experience, respectively)—blinded to the specific model—compared the LLM-simplified output to the radiologist-dictated impression. The radiologists were asked to rate four statements (Table) via a five-point Likert scale (1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree). The statements were as follows: statement 1, “The simplified version does not contain any inaccurate or misleading information;” statement 2, “The simplified version includes all relevant/actionable information present in the original impression;” statement 3, “The simplified version offers beneficial supplementary information not found in the original impression;” and statement 4, “I feel comfortable giving the simplifieded output to patients without any supervision.” For each specific model and output, the radiologists scores were averaged together.Table 1: Reading Grade Level, Word Count, and Survey Scores for Each Model and ModalityPython version 3.11 (2022) was used to gather readability scores and word count. R (R Core Team, 2022) was used for data visualization and to conduct Wilcoxon signed-rank tests.ResultsAll models significantly simplified the impression aRGL across all modalities (P < .0001) (Table). Both radiologists strongly agreed that 86% (129 of 150), 83.3% (125 of 150), 75.3% (113 of 150), and 83.3% (125 of 150) of the simplified output contained both no inaccurate information (statement) and all the relevant and/or actionable information (statement 2) for ChatGPT-3.5, GPT-4, Google Bard, and Bing, respectively. Furthermore, there were 0, one, two, and 0 instances where the average reviewer score was neutral or worse for statement 1 for ChatGPT-3.5, GPT-4, Google Bard, and Bing, respectively. There were 0, 0, two, and 0 instances where the average reviewer score was neutral or worse for statement 2 for ChatGPT-3.5, GPT-4, Google Bard, and Bing, respectively.Overall, both ChatGPT-3.5 and Bing were significantly more accurate (statement 1) than Bard, while both ChatGPT models and Bing contained the relevant/actionable information (statement 2) significantly more often than Bard (P < .05) (Figure, Table). Bard’s output contained the most supplemental information (statement 3) and the greatest word count, followed by Bing, GPT-4, and ChatGPT-3.5, with each sequential difference in supplementary information and word count statistically significant (P < .01) (Figure, Table). The reviewers felt significantly more comfortable providing output (statement 4) from both ChatGPT models to patients compared with output from Bing and Google Bard (P < .01) (Figure, Table).Survey responses for each model and modality. Survey statements were as follows: accurate—the simplified version does not contain any inaccurate or misleading information; relevant—the simplified version includes all relevant/actionable information present in the original impression; supplemental information (info)—the simplified version offers beneficial supplementary information not found in the original impression; and release—I feel comfortable giving the simplified output to patients without any supervision. All reviews were conducted with use of a five-point Likert scale with whole numbers (1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree); 0.5 values arise due to the averaging of reviewer scores. (A, B) Bar charts show survey responses for (A) all reports and (B) US reports. Bar charts show survey responses for (C) mammography and (D) CT. Bar charts show survey responses for (E) MRI and (F) radiography.Download as PowerPointDiscussionAll models evaluated, particularly ChatGPT-3.5 and Bing, were accurate when simplifying radiology reports with a basic prompt. The outputs of both ChatGPT-3.5 and Bing were at a higher aRGL, which may contribute to their greater accuracy. Our findings suggest LLMs may help patients simplify radiologist-dictated impressions. At the same time, the relatively high accuracy and/or relevance of simplified impressions and low word count suggests providers could readily provide—with an accuracy check—simplified output to patients (within locally hosted and Health Insurance Portability and Accountability Act–compliant LLMs). However, future workflow studies are required, particularly to ensure that the added value to patients does not come at an onerous cost to radiologists.Disclosures of conflicts of interest: K.S.A. No relevant relationships. M.A.D. Honorarium for grand rounds at Massachusetts General Hospital; board member, Joint Review Committee on Education in Radiologic Technology. R.D. Patents planned, issued or pending with Yale School of Medicine. A.H.H. Payment for expert testimony from various law firms for counsel in malpractice cases. P.K. No relevant relationships. H.P.F. Associate editor for Radiology.AcknowledgmentThe authors used large language models to generate the simplified radiology reports.Author ContributionsAuthor contributions: Guarantors of integrity of entire study, K.S.A., M.A.D., R.D., P.K., H.P.F.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, K.S.A., R.D., A.H.H., P.K.; clinical studies, M.A.D., R.D., A.H.H.; experimental studies, K.S.A., R.D., H.P.F.; statistical analysis, K.S.A., M.A.D., R.D., P.K.; and manuscript editing, K.S.A., M.A.D., R.D., P.K., H.P.F.References1. ONC’s Cures Act Final Rule. The Office of the National Coordinator for Health Information Technology (ONC). https://www.healthit.gov/topic/oncs-cures-act-final-rule. Accessed September 17, 2023. Google Scholar2. Amin K, Khosla P, Doshi R, Chheang S, Forman HP. Artificial Intelligence to Improve Patient Understanding of Radiology Reports. Yale J Biol Med 2023;96(3):407–417. Crossref, Medline, Google Scholar3. Ward E, Gross C. Evolving Methods to Assess Chatbot Performance in Health Sciences Research. JAMA Intern Med 2023;183(9):1030–1031. Crossref, Medline, Google Scholar4. Doshi R, Amin K, Khosla P, Bajaj S, Chheang S, Forman HP. Utilizing Large Language Models to Simplify Radiology Reports: a comparative analysis of ChatGPT3. 5, ChatGPT4. 0, Google Bard, and Microsoft Bing. medRxiv [preprint] 2023.06.04.23290786. https://doi.org/10.1101/2023.06.04.23290786. Published June 7, 2023. Accessed September 17, 2023. Google Scholar5. ChatGPT-3.5. (July 20, 2023 version). OpenAI. https://openai.com/blog/chatgpt. Accessed Juy 23–26, 2023. Google Scholar6. ChatGPT-4. (July 20, 2023 version). OpenAI. https://openai.com/blog/chatgpt. Accessed Juy 23–26, 2023. Google Scholar7. Google Bard. (July 13, 2023 version). https://bard.google.com. Google Scholar8. Microsoft Corporation. Microsoft Bing Chat. (July 21, 2023 version). https://www.microsoft.com/en-us/edge/features/bing-chat?form=MT00D8. Accessed Juy 23–26, 2023. Google Scholar9. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. Mimic-iv PhysioNet. https://physionet.org/content/mimiciv/0.4/. Published August 13, 2020. Accessed July 18, 2023. Google ScholarArticle HistoryReceived: Sept 23 2023Revision requested: Oct 17 2023Revision received: Oct 25 2023Accepted: Oct 30 2023Published online: Nov 21 2023 FiguresReferencesRelatedDetailsRecommended Articles Natural Language Processing in Radiology: A Systematic ReviewRadiology2016Volume: 279Issue: 2pp. 329-343Evaluating GPT-4 on Impressions Generation in Radiology ReportsRadiology2023Volume: 307Issue: 5Deep Learning–based Assessment of Oncologic Outcomes from Natural Language Processing of Structured Radiology ReportsRadiology: Artificial Intelligence2022Volume: 4Issue: 5Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology ReportsRadiology: Artificial Intelligence2022Volume: 4Issue: 4Effect of Shift, Schedule, and Volume on Interpretive Accuracy: A Retrospective Analysis of 2.9 Million Radiologic ExaminationsRadiology2017Volume: 287Issue: 1pp. 205-212See More RSNA Education Exhibits Pandemic Preparedness: Streamlining the Breast Imaging Patient Care and Teaching Experience During COVID-19Digital Posters2020Certifications, Audits, And National Benchmarks: Breaking Down The Basics For The New Mammography AttendingDigital Posters2021Easy Introduction Of The Photon Counting Detector CT (PCD-CT) For RadiologistsDigital Posters2021 RSNA Case Collection Fitz-Hugh-Curtis syndromeRSNA Case Collection2021Secondary Angiosarcoma of the BreastRSNA Case Collection2021Sternalis muscle - normal variantRSNA Case Collection2020 Vol. 309, No. 2 Metrics Altmetric Score PDF download