Accuracy of ChatGPT, Google Bard, and Microsoft Bing for Simplifying Radiology Reports

医学 行话 工作流程 港口 万维网 放射科 语言学 计算机科学 哲学 数学 组合数学 数据库
作者
Kanhai Amin,Melissa Davis,Rushabh Doshi,Andrew Haims,P.K. Khosla,Howard P. Forman
出处
期刊:Radiology [Radiological Society of North America]
卷期号:309 (2) 被引量:28
标识
DOI:10.1148/radiol.232561
摘要

HomeRadiologyVol. 309, No. 2 PreviousNext Original ResearchFree AccessComputer ApplicationsAccuracy of ChatGPT, Google Bard, and Microsoft Bing for Simplifying Radiology ReportsKanhai S. Amin, Melissa A. Davis, Rushabh Doshi, Andrew H. Haims, Pavan Khosla, Howard P. Forman Kanhai S. Amin, Melissa A. Davis, Rushabh Doshi, Andrew H. Haims, Pavan Khosla, Howard P. Forman Author AffiliationsFrom the Department of Radiology and Biomedical Imaging, Yale School of Medicine, 333 Cedar St, New Haven, CT 06520.Address correspondence to H.P.F. (email: [email protected]).Kanhai S. AminMelissa A. DavisRushabh DoshiAndrew H. HaimsPavan KhoslaHoward P. Forman Published Online:Nov 21 2023https://doi.org/10.1148/radiol.232561MoreSectionsPDF ToolsAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinked In AbstractDownload as PowerPointIntroductionWith the advent of the Office of the National Coordinator for Health Information Technology’s Cures Act Final Rule and its information blocking provision, radiology reports have become increasingly accessible to patients (1). However, patients may not be able to understand their reports due to many factors, including radiology-specific jargon. This can lead to increased patient anxiety and call volume to providers (2). Many solutions, such as providing a lay summary or second report in lay language, have been proposed (2). Emerging technologies such as large language models (LLMs) powered by natural language processing (NLP) can generate these additional lay language materials without significantly hindering the radiologist workflow. While these technologies may soon be used on the provider side, patients are already engaging with publicly available LLMs: ChatGPT alone has more than 100 million users (3).One study (4) has demonstrated that four publicly available LLMs—ChatGPT-3.5 (5), GPT-4 (6), Google Bard (7), and Microsoft Bing (8)—can significantly simplify radiology reports. The present work assesses the accuracy of the four LLMs when asked the basic prompt “Simplify this radiology report.”Materials and MethodsFrom 750 radiology report impressions—gathered from the de-identified, publicly available, and Health Insurance Portability and Accountability Act–compliant MIMIC-IV database (9)—assessed in a previous article (4), we randomly selected 150 impressions (30 from CT, 30 from mammography, 30 from MRI, 30 from US, and 30 from radiography) and their simplified output. The average reading grade level (aRGL) was reassessed for this subset of reports by averaging grade-level scores calculated with the Gunning Fog Index, Flesch-Kincaid Grade Level Readability Calculator, Automated Readability Index, and Coleman-Liau Readability Index (4).Two radiology attending physicians (M.A.D. and A.H.H., with 9 and 24 years of experience, respectively)—blinded to the specific model—compared the LLM-simplified output to the radiologist-dictated impression. The radiologists were asked to rate four statements (Table) via a five-point Likert scale (1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree). The statements were as follows: statement 1, “The simplified version does not contain any inaccurate or misleading information;” statement 2, “The simplified version includes all relevant/actionable information present in the original impression;” statement 3, “The simplified version offers beneficial supplementary information not found in the original impression;” and statement 4, “I feel comfortable giving the simplifieded output to patients without any supervision.” For each specific model and output, the radiologists scores were averaged together.Table 1: Reading Grade Level, Word Count, and Survey Scores for Each Model and ModalityPython version 3.11 (2022) was used to gather readability scores and word count. R (R Core Team, 2022) was used for data visualization and to conduct Wilcoxon signed-rank tests.ResultsAll models significantly simplified the impression aRGL across all modalities (P < .0001) (Table). Both radiologists strongly agreed that 86% (129 of 150), 83.3% (125 of 150), 75.3% (113 of 150), and 83.3% (125 of 150) of the simplified output contained both no inaccurate information (statement) and all the relevant and/or actionable information (statement 2) for ChatGPT-3.5, GPT-4, Google Bard, and Bing, respectively. Furthermore, there were 0, one, two, and 0 instances where the average reviewer score was neutral or worse for statement 1 for ChatGPT-3.5, GPT-4, Google Bard, and Bing, respectively. There were 0, 0, two, and 0 instances where the average reviewer score was neutral or worse for statement 2 for ChatGPT-3.5, GPT-4, Google Bard, and Bing, respectively.Overall, both ChatGPT-3.5 and Bing were significantly more accurate (statement 1) than Bard, while both ChatGPT models and Bing contained the relevant/actionable information (statement 2) significantly more often than Bard (P < .05) (Figure, Table). Bard’s output contained the most supplemental information (statement 3) and the greatest word count, followed by Bing, GPT-4, and ChatGPT-3.5, with each sequential difference in supplementary information and word count statistically significant (P < .01) (Figure, Table). The reviewers felt significantly more comfortable providing output (statement 4) from both ChatGPT models to patients compared with output from Bing and Google Bard (P < .01) (Figure, Table).Survey responses for each model and modality. Survey statements were as follows: accurate—the simplified version does not contain any inaccurate or misleading information; relevant—the simplified version includes all relevant/actionable information present in the original impression; supplemental information (info)—the simplified version offers beneficial supplementary information not found in the original impression; and release—I feel comfortable giving the simplified output to patients without any supervision. All reviews were conducted with use of a five-point Likert scale with whole numbers (1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree); 0.5 values arise due to the averaging of reviewer scores. (A, B) Bar charts show survey responses for (A) all reports and (B) US reports. Bar charts show survey responses for (C) mammography and (D) CT. Bar charts show survey responses for (E) MRI and (F) radiography.Download as PowerPointDiscussionAll models evaluated, particularly ChatGPT-3.5 and Bing, were accurate when simplifying radiology reports with a basic prompt. The outputs of both ChatGPT-3.5 and Bing were at a higher aRGL, which may contribute to their greater accuracy. Our findings suggest LLMs may help patients simplify radiologist-dictated impressions. At the same time, the relatively high accuracy and/or relevance of simplified impressions and low word count suggests providers could readily provide—with an accuracy check—simplified output to patients (within locally hosted and Health Insurance Portability and Accountability Act–compliant LLMs). However, future workflow studies are required, particularly to ensure that the added value to patients does not come at an onerous cost to radiologists.Disclosures of conflicts of interest: K.S.A. No relevant relationships. M.A.D. Honorarium for grand rounds at Massachusetts General Hospital; board member, Joint Review Committee on Education in Radiologic Technology. R.D. Patents planned, issued or pending with Yale School of Medicine. A.H.H. Payment for expert testimony from various law firms for counsel in malpractice cases. P.K. No relevant relationships. H.P.F. Associate editor for Radiology.AcknowledgmentThe authors used large language models to generate the simplified radiology reports.Author ContributionsAuthor contributions: Guarantors of integrity of entire study, K.S.A., M.A.D., R.D., P.K., H.P.F.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, K.S.A., R.D., A.H.H., P.K.; clinical studies, M.A.D., R.D., A.H.H.; experimental studies, K.S.A., R.D., H.P.F.; statistical analysis, K.S.A., M.A.D., R.D., P.K.; and manuscript editing, K.S.A., M.A.D., R.D., P.K., H.P.F.References1. ONC’s Cures Act Final Rule. The Office of the National Coordinator for Health Information Technology (ONC). https://www.healthit.gov/topic/oncs-cures-act-final-rule. Accessed September 17, 2023. Google Scholar2. Amin K, Khosla P, Doshi R, Chheang S, Forman HP. Artificial Intelligence to Improve Patient Understanding of Radiology Reports. Yale J Biol Med 2023;96(3):407–417. Crossref, Medline, Google Scholar3. Ward E, Gross C. Evolving Methods to Assess Chatbot Performance in Health Sciences Research. JAMA Intern Med 2023;183(9):1030–1031. Crossref, Medline, Google Scholar4. Doshi R, Amin K, Khosla P, Bajaj S, Chheang S, Forman HP. Utilizing Large Language Models to Simplify Radiology Reports: a comparative analysis of ChatGPT3. 5, ChatGPT4. 0, Google Bard, and Microsoft Bing. medRxiv [preprint] 2023.06.04.23290786. https://doi.org/10.1101/2023.06.04.23290786. Published June 7, 2023. Accessed September 17, 2023. Google Scholar5. ChatGPT-3.5. (July 20, 2023 version). OpenAI. https://openai.com/blog/chatgpt. Accessed Juy 23–26, 2023. Google Scholar6. ChatGPT-4. (July 20, 2023 version). OpenAI. https://openai.com/blog/chatgpt. Accessed Juy 23–26, 2023. Google Scholar7. Google Bard. (July 13, 2023 version). https://bard.google.com. Google Scholar8. Microsoft Corporation. Microsoft Bing Chat. (July 21, 2023 version). https://www.microsoft.com/en-us/edge/features/bing-chat?form=MT00D8. Accessed Juy 23–26, 2023. Google Scholar9. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. Mimic-iv PhysioNet. https://physionet.org/content/mimiciv/0.4/. Published August 13, 2020. Accessed July 18, 2023. Google ScholarArticle HistoryReceived: Sept 23 2023Revision requested: Oct 17 2023Revision received: Oct 25 2023Accepted: Oct 30 2023Published online: Nov 21 2023 FiguresReferencesRelatedDetailsRecommended Articles Natural Language Processing in Radiology: A Systematic ReviewRadiology2016Volume: 279Issue: 2pp. 329-343Evaluating GPT-4 on Impressions Generation in Radiology ReportsRadiology2023Volume: 307Issue: 5Deep Learning–based Assessment of Oncologic Outcomes from Natural Language Processing of Structured Radiology ReportsRadiology: Artificial Intelligence2022Volume: 4Issue: 5Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology ReportsRadiology: Artificial Intelligence2022Volume: 4Issue: 4Effect of Shift, Schedule, and Volume on Interpretive Accuracy: A Retrospective Analysis of 2.9 Million Radiologic ExaminationsRadiology2017Volume: 287Issue: 1pp. 205-212See More RSNA Education Exhibits Pandemic Preparedness: Streamlining the Breast Imaging Patient Care and Teaching Experience During COVID-19Digital Posters2020Certifications, Audits, And National Benchmarks: Breaking Down The Basics For The New Mammography AttendingDigital Posters2021Easy Introduction Of The Photon Counting Detector CT (PCD-CT) For RadiologistsDigital Posters2021 RSNA Case Collection Fitz-Hugh-Curtis syndromeRSNA Case Collection2021Secondary Angiosarcoma of the BreastRSNA Case Collection2021Sternalis muscle - normal variantRSNA Case Collection2020 Vol. 309, No. 2 Metrics Altmetric Score PDF download
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
汉堡包应助kang采纳,获得10
刚刚
科研留下了新的社区评论
刚刚
刚刚
百灵发布了新的文献求助10
刚刚
111完成签到,获得积分10
1秒前
香蕉觅云应助nilu采纳,获得10
2秒前
001026Z完成签到,获得积分10
3秒前
zzt完成签到,获得积分10
5秒前
蓝色发布了新的文献求助10
5秒前
8秒前
123完成签到,获得积分10
8秒前
9秒前
9秒前
百灵完成签到,获得积分10
10秒前
517843291关注了科研通微信公众号
12秒前
Victor完成签到,获得积分10
12秒前
glaze完成签到 ,获得积分10
13秒前
霖爪飞扬发布了新的文献求助10
14秒前
宁异勿同发布了新的文献求助10
14秒前
16秒前
nilu完成签到,获得积分10
16秒前
狂吃五碗饭完成签到,获得积分10
18秒前
酷波er应助德芙纵向丝滑采纳,获得10
18秒前
无私的含海完成签到,获得积分10
19秒前
记不清发布了新的文献求助10
20秒前
蓝色发布了新的文献求助10
21秒前
21秒前
爆米花应助xiaoyue采纳,获得10
23秒前
26秒前
MX应助huyang采纳,获得10
26秒前
滴滴滴发布了新的文献求助10
27秒前
德芙纵向丝滑完成签到,获得积分20
29秒前
Zz完成签到,获得积分10
29秒前
MAD666发布了新的文献求助30
31秒前
32秒前
完美世界应助江峰采纳,获得10
32秒前
宁异勿同完成签到,获得积分10
33秒前
33秒前
34秒前
斯文的难破完成签到 ,获得积分10
36秒前
高分求助中
Basic Discrete Mathematics 1000
Technologies supporting mass customization of apparel: A pilot project 600
Introduction to Strong Mixing Conditions Volumes 1-3 500
Tip60 complex regulates eggshell formation and oviposition in the white-backed planthopper, providing effective targets for pest control 400
A Field Guide to the Amphibians and Reptiles of Madagascar - Frank Glaw and Miguel Vences - 3rd Edition 400
China Gadabouts: New Frontiers of Humanitarian Nursing, 1941–51 400
The Healthy Socialist Life in Maoist China, 1949–1980 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3799095
求助须知:如何正确求助?哪些是违规求助? 3344848
关于积分的说明 10321650
捐赠科研通 3061268
什么是DOI,文献DOI怎么找? 1680100
邀请新用户注册赠送积分活动 806904
科研通“疑难数据库(出版商)”最低求助积分说明 763445