Seeing the Unseen: Advancing Generative AI Research in Radiology

医学 放射科 医学物理学 人工智能 计算机科学
作者
Woojin Kim
出处
期刊:Radiology [Radiological Society of North America]
卷期号:311 (2)
标识
DOI:10.1148/radiol.240935
摘要

HomeRadiologyVol. 311, No. 2 PreviousNext Reviews and CommentaryFree AccessEditorialSeeing the Unseen: Advancing Generative AI Research in RadiologyWoojin Kim Woojin Kim Author AffiliationsFrom Rad AI, San Francisco, Calif; and Department of Radiology, Palo Alto VA Medical Center, 3801 Miranda Ave, Palo Alto, CA 94304.Address correspondence to the author (email: [email protected]).Woojin Kim Published Online:May 21 2024https://doi.org/10.1148/radiol.240935MoreSectionsPDF ToolsAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookXLinked In IntroductionIt seems obvious now, but I can still recall my fascination when I first learned about the Maisonneuve fracture. Every radiologist is taught when you see a fracture of the medial malleolus with a widening of the distal tibiofibular syndesmosis on ankle radiographs to look higher for the proximal fibular fracture that may be present beyond the edge of the image. It is a lesson in how understanding the mechanism of injury and pathophysiology expands our vision. As the world continues to be captivated by generative artificial intelligence (AI), including radiology, we should aim to examine beyond what is visible and see the unseen for more effective AI research in radiology (Table).Key Points for Advancing Generative AI Research in RadiologyOn EvaluationsIn March 2023, OpenAI published a remarkable list of academic and professional examinations aced by GPT-4 (1). Other articles subsequently made headlines by demonstrating GPT-4's ability to pass the U.S. Medical Licensing Examination, or USMLE (2), and a radiology board–style examination (3). While the performance may appear impressive, testing large language models (LLMs) like people could lead to misleading results and misinterpretation of their capabilities. Although academic examinations may not always accurately measure students' abilities, we often assume that those who perform well possess particular understanding, knowledge, and problem-solving skills. However, determining whether a high score from LLMs indicates genuine comprehension, results from statistical correlations, or stems from mere memorization can be challenging. Not knowing GPT-4's training data, we cannot dismiss the possibility of data contamination—the chance that the model may have encountered the examination questions during its training. In addition, when someone does well on an examination, you expect the person to perform well on similar examinations. However, this is not necessarily the case with LLMs. LLMs are brittle, where a slight change in the question or the ordering of the multiple-choice options (4) can change how LLMs answer. To demonstrate this fragility, I took one of the examples from the Radiology article by Bhayana et al (3). I changed one word from "What is the absolute washout for this lesion?" to "What is the relative washout for this lesion?" and GPT-4 used the wrong washout formula. Subsequently, when asked about these concepts individually, GPT-4 mixed up the formulas and descriptions. Indeed, Microsoft researchers showed that LLMs achieved poorer performance when they modified the original evaluation problems into new ones through paraphrasing or additional context (5), highlighting the need for more dynamic evaluation methods (6). What's more, LLMs trained on "A is B" are often blind to the inverse "B is A" (ie, "reversal curse") (7). The stochastic nature of these models means that repeated inquiries can result in different responses. So, does GPT-4 genuinely know and understand these concepts? There is heated debate over understanding in LLMs, including the very definition of what it means to "understand." Regardless of where you stand on this debate, we need to look beyond the headline-grabbing test results and reconsider how we evaluate and promote LLMs.Does it even make sense to test LLMs the same way we test people? Narayanan and Kapoor (8) argue that "human benchmarks are meaningless for bots." Bhayana (9) acknowledged that licensing examination performance is not a proxy for safe and effective care. The unseen is how LLMs pass these examinations. Instead of focusing on the test results, we should focus on the "how" and the "why" of LLM performance (10). By acknowledging the differences between LLMs and humans and using evaluation metrics designed to assess real-world tasks, we can better assess their capabilities and limitations. While I anticipate publications of similar testing of GPT-5, Google Gemini 2.0, and so on, I hope the research community will shift its focus to figuring out what is happening under the hood and evaluating ways that reflect the actual practice of medicine. After all, the last time I looked, none of the images on my picture archiving and communication system came with multiple-choice options.On Closed ModelsOne of the fundamental issues with researching models such as GPT-4 is that such models are "closed models," meaning we do not know their architecture, training data, and so on. Performing research and writing manuscripts along the lines of "I asked ChatGPT x, and it responded with y" teeters perilously close to penning a mere product review. While reviews of commercial solutions have a place in the radiology literature, one must be careful when attributing scientific merit. Why? Because we have no idea what these models are trained on or how. For example, you can find the question I used from the article by Bhayana et al (3) online. So, did the model genuinely figure out the answer, or did it regurgitate memorized content? We may never know, but research has demonstrated the data contamination issue with these LLMs (11). Moreover, LLMs change over time, and this "LLM drift" (12) can happen silently, unbeknownst to the researchers studying them. The LLMs may also modify our prompts and their outputs. While this practice may serve as a guardrail against misuse, it can also have undesirable outcomes (13). Companies can justifiably guard the secrets of their creation. Still, we should be careful in assigning scientific credibility to something we cannot examine (14).On EmergenceEmergence is another alluring but potentially misleading notion in LLMs. It is a concept popularized by the 1972 essay "More is Different" by Nobel laureate Philip W. Anderson (15). In LLMs, emergence describes new abilities or behaviors that were not explicitly programmed into the model and are not seen in smaller models (16). However, Schaeffer et al (17) showed that the so-called emergent abilities of LLMs are often due to the researcher's choice of metric rather than a leap in the model's intrinsic capabilities arising from complexity and scale. While emergence may be possible with a mixture of models, multimodality, and multiple agents (18), overreliance on this concept has the danger of oversimplifying the complex inner workings of these models and discouraging us from investigating further (19).On Synthetic DataCollecting data in the medical domain is time-consuming and expensive. Using generative AI to create synthetic data has the potential benefits of addressing privacy concerns and augmenting data diversity and quality within the medical domain. Using generative adversarial networks and diffusion models, several groups have shown abilities to create synthetic radiographs, CT images, and MRI scans (20–22). While there are many potential benefits of synthetic data, this is another area that warrants caution. What's visible is the easy part, as we can now create synthetic lesions that could convince any radiologist. What's challenging is what we cannot see. Gichoya et al (23) showed that deep learning models can be trained to predict race from medical images with high performance. Yet, the mechanism of such detection remains elusive. So, what else are the AI models synthesizing besides the lesion you're augmenting in the training data set? What hidden biases are you unwittingly perpetuating? Finally, relying too much on synthetic data can backfire, as they often lack the richness and complexity of real data and can lead to "model collapse," where the model performance worsens over time (24).On Clinical Domain ExpertiseWhen working on a radiology solution, researchers and vendors must have radiologists on their team to ensure the technology is safe and effective (25). The study by Thawkar et al (26) is an egregious example of the pitfalls of omitting clinical domain expertise. While a name like "XrayGPT" may gain attention, if you look closely at the chatbot interface (26,27), you will notice that the images and accompanying text bear no relation. I have observed similar missteps with some of the early deep learning computer vision applications in radiology, and history appears to be repeating itself with foundation models. As a radiology resident, I often heard, "One view is no view." Similarly, when generating radiology reports for most examination types, one image is no image. The article by Yang et al (28) claimed the effectiveness of GPT-4V in "medical image understanding," where the researchers evaluated the application of GPT-4V in radiology report generation using only a single image. However, their "accurate" examples were fraught with errors and issues. The gravity of our work—with the potential to influence patient life or death outcomes—necessitates incorporating clinical domain expertise. Engaging radiologists not as an afterthought but in an iterative process from inception to deployment is not optional but essential.On AI AdoptionA famous figure from an article by Google showed that only a tiny portion of real-world machine learning systems are composed of machine learning code, highlighting the complexity of the machine learning infrastructure (29). Similarly, clinical adoption of AI is complicated, and we need to look beyond the clinical accuracy of models that many research studies focus on. In the process, we must be careful not to equate clinical accuracy with clinical efficiency (30). I would further argue that neither clinical accuracy nor efficiency necessarily translates to clinical utility. If an AI solution enables a radiologist to read 200 cases a day instead of 100, that enhanced "efficiency" is impractical, as it will burn out the radiologist and exacerbate the staffing shortage problem that is plaguing the world today—unless this can be achieved with the effort of reading, say, 50 cases. Jha (31), in what he describes as radiology's AI adoption dilemma, further explores the complexity around efficiency and productivity. We should harness the potential of AI by maximizing human-AI symbiosis through collaborative workflows that complement human cognition and AI (32), ultimately enabling the return of the radiologist as the "doctor's doctor" (33). One illustrative use case in generative AI in radiology is automatic impression generation that balances increased efficiency with reduced cognitive load. What is often overlooked in academic discussion yet observed from years of commercial use (Omni Impressions, Rad AI) is the importance of personalizing these automated impressions. This personal touch, appealing to the radiologists' preferences for their own narrative style, is beginning to be appreciated and studied (34). It is important to recognize that personalization is important in clinical AI adoption, where a one-size-fits-all approach often does not work (35).Final ThoughtsIn the parable of the blind men and an elephant, a group of blind men who had never encountered an elephant before attempted to learn and imagine what an elephant might be like by touching it. With curiosity, each blind man touched a different part of the animal's body, but only one part, such as the trunk, tusk, or tail. They each described the animal based on their limited experience and arrived at wildly divergent conclusions. In some versions of the story, their disagreements escalated to physical altercations. The moral of the parable is that people often claim absolute truth based on their limited and subjective experiences while ignoring others' equally valid and limited vantage points (36). When I read articles and listen to others on generative AI, I am reminded of this parable as a metaphor for navigating the uncharted terrains of generative AI—while also acknowledging my potential role as one of the proverbial blind men. Hence, it is also a reminder to approach this field with humility and caution. Generative AI's potential impact on health care is significant and transformational. It will be important to keep an open mind, appreciate others' perspectives, and continue our collective exploration to see the unseen.Disclosures of conflicts of interest: W.K. Consulting fees from ClariPi, Hyperfine Research, Infiniti Medical, and Nuance Communications; honoraria from the Radiology Business Management Association and University of Pennsylvania; support for attending meetings or travel from Equium Intelligence and Rad AI; patents planned, issued, or pending with Equium Intelligence and Rad AI; participation on a data safety monitoring board or advisory board for Alara Imaging, Braid Health, ImageBiopsy Lab, Inference Analytics, Luxsonic Technologies, Rad AI, and Within Health; board member for the Society for Imaging Informatics in Medicine; editorial board member for the Journal of Imaging Informatics in Medicine; stock or stock options in Equium Intelligence, Nuance Communications, and Rad AI.References1. Achiam J, Adler S, Agarwal S, et al. GPT-4 technical paper. Open AI. arXiv 2303.08774 [preprint] https://arxiv.org/abs/2303.08774. Posted March 15, 2023. Updated March 4, 2024. Accessed March 27, 2024. Google Scholar2. Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv 2303.13375 [preprint] https://arxiv.org/abs/2303.13375. Posted March 20, 2023. Updated April 12, 2023. Accessed March 27, 2024. Google Scholar3. Bhayana R, Bleakney RR, Krishna S. GPT-4 in radiology: improvements in advanced reasoning. Radiology 2023;307(5):e230987. Link, Google Scholar4. Pezeshkpour P, Hruschka E. Large language models sensitivity to the order of options in multiple-choice questions. arXiv 2308.11483 [preprint] https://arxiv.org/abs/2308.11483. Posted August 22, 2023. Accessed March 27, 2024. Google Scholar5. Zhu K, Wang J, Zhao Q, et al. DyVal 2: dynamic evaluation of large language models by meta probing agents. arXiv 2402.14865 [preprint] https://arxiv.org/abs/2402.14865. Posted February 21, 2024. Accessed March 27, 2024. Google Scholar6. Wang S, Long Z, Fan Z, et al. Benchmark self-evolving: a multi-agent framework for dynamic LLM evaluation. arXiv 2402.11443 [preprint] https://arxiv.org/abs/2402.11443. Posted February 18, 2024. Accessed March 27, 2024. Google Scholar7. Berglund L, Tong M, Kaufmann M, et al. The reversal curse: LLMs trained on "A is B" fail to learn "B is A". arXiv 2309.12288 [preprint] https://arxiv.org/abs/2309.12288. Posted September 21, 2023. Updated April 4, 2024. Accessed March 27, 2024. Google Scholar8. Narayanan A, Kapoor S. GPT-4 and professional benchmarks: the wrong answer to the wrong question. AI Snake Oil. https://www.aisnakeoil.com/p/gpt-4-and-professional-benchmarks. Published March 20, 2023. Accessed March 27, 2024. Google Scholar9. Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 2024;310(1):e232756. Link, Google Scholar10. Heaven WD. AI hype is built on high test scores. Those tests are flawed. MIT Technology Review. https://www.technologyreview.com/2023/08/30/1078670/large-language-models-arent-people-lets-stop-testing-them-like-they-were. Published August 30, 2023. Accessed March 27, 2024. Google Scholar11. Golchin S, Surdeanu M. Time travel in LLMs: tracing data contamination in large language models. arXiv 2308.08493 [preprint] https://arxiv.org/abs/2308.08493. Posted August 16, 2023. Updated February 21, 2024. Accessed March 27, 2024. Google Scholar12. Chen L, Zaharia M, Zou J. How is ChatGPT's behavior changing over time? arXiv 2307.09009 [preprint] https://arxiv.org/abs/2307.09009. Posted July 18, 2023. Updated October 31, 2023. Accessed March 27, 2024. Google Scholar13. Is Google's Gemini chatbot woke by accident, or by design? The Economist. https://www.economist.com/united-states/2024/02/28/is-googles-gemini-chatbot-woke-by-accident-or-design. Published February 28, 2024. Accessed March 27, 2024. Google Scholar14. Rogers A. Closed AI models make bad baselines. Towards Data Science. https://towardsdatascience.com/closed-ai-models-make-bad-baselines-4bf6e47c9e6a. Published April 24, 2023. Accessed March 27, 2024. Google Scholar15. Anderson PW. More is different. Science 1972;177(4047):393–396. Crossref, Medline, Google Scholar16. Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models. arXiv 2206.07682 [preprint] https://arxiv.org/abs/2206.07682. Posted June 15, 2022. Updated October 26, 2022. Accessed March 27, 2024. Google Scholar17. Schaeffer R, Miranda B, Koyejo S. Are emergent abilities of large language models a mirage? arXiv 2304.15004 [preprint] https://arxiv.org/abs/2304.15004. Posted April 28, 2023. Updated May 22, 2023. Accessed March 27, 2024. Google Scholar18. Lungren MP, Fishman EK, Chu LC, Rizk RC, Rowe SP. More is different: large language models in health care. J Am Coll Radiol 2023S1546-1440(23)00962-6. Google Scholar19. "Emergence" isn't an explanation, it's a prayer. From Narrow To General AI. https://ykulbashian.medium.com/emergence-isnt-an-explanation-it-s-a-prayer-ef239d3687bf. Published July 15, 2023. Accessed March 27, 2024. Google Scholar20. Chambon P, Bluethgen C, Delbrouck JB, et al. RoentGen: vision-language foundation model for chest x-ray generation. arXiv 2211.12737 [preprint] https://arxiv.org/abs/2211.12737. Posted November 23, 2022. Accessed March 27, 2024. Google Scholar21. Pan S, Wang T, Qiu RLJ, et al. 2D medical image synthesis using transformer-based denoising diffusion probabilistic model. Phys Med Biol 2023;68(10):105004. Crossref, Google Scholar22. Rouzrokh P, Khosravi B, Faghani S, et al. Multitask brain tumor inpainting with diffusion models: a methodological report. arXiv 2210.12113 [preprint] https://arxiv.org/abs/2210.12113. Posted October 21, 2022. Updated March 30, 2023. Accessed March 27, 2024. Google Scholar23. Gichoya JW, Banerjee I, Bhimireddy AR, et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health 2022;4(6):e406–e414. Crossref, Medline, Google Scholar24. Shumailov I, Shumaylov Z, Zhao Y, et al. The curse of recursion: training on generated data makes models forget. arXiv 2305.17493 [preprint] https://arxiv.org/abs/2305.17493. Posted May 27, 2023. Updated April 14, 2024. Accessed March 27, 2024. Google Scholar25. Yildirim N, Richardson H, Wetscherek MT, et al. Multimodal healthcare AI: identifying and designing clinically relevant vision-language applications for radiology. arXiv 2402.14252 [preprint] https://arxiv.org/abs/2402.14252. Posted February 22, 2024. Accessed March 27, 2024. Google Scholar26. Thawkar O, Shaker A, Mullappilly SS, et al. XrayGPT: chest radiographs summarization using large language medical vision-language models. arXiv 2306.07971 [preprint] https://arxiv.org/abs/2306.07971. Posted June 13, 2023. Accessed March 27, 2024. Google Scholar27. Shaker A. XrayGPT: chest radiographs summarization using medical vision-language models. https://www.youtube.com/watch?v=-zzq7bzbUuY. Published May 19, 2023. Accessed March 27, 2024. Google Scholar28. Yang Z, Li L, Wang J, et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv 2309.17421 [preprint] https://arxiv.org/abs/2309.17421. Posted September 29, 2023. Updated October 11, 2023. Accessed March 27, 2024. Google Scholar29. Sculley D, Holt G, Golovin D, et al. Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems 28 (NIPS 2015). https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf. Published 2015. Accessed March 27, 2024. Google Scholar30. Jongsma KR, Sand M, Milota M. Why we should not mistake accuracy of medical AI for efficiency. NPJ Digit Med 2024;7(1):57. Crossref, Medline, Google Scholar31. Jha S. Algorithms at the gate—radiology's AI adoption dilemma. JAMA 2023;330(17):1615–1616. Crossref, Medline, Google Scholar32. Gefter WB, Prokop M, Seo JB, Raoof S, Langlotz CP, Hatabu H. Human-AI symbiosis: a path forward to improve chest radiography and the role of radiologists in patient care. Radiology 2024;310(1):e232778. Link, Google Scholar33. Chang PJ. Imaging informatics: maturing beyond adolescence to enable the return of the doctor's doctor. Radiology 2023;309(1):e230936. Link, Google Scholar34. Tie X, Shin M, Pirasteh A, et al. Personalized impression generation for PET reports using large language models. J Imaging Inform Med 2024. Crossref, Google Scholar35. Yu F, Moehring A, Banerjee O, Salz T, Agarwal N, Rajpurkar P. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat Med 2024;30(3):837–849. Crossref, Medline, Google Scholar36. Blind men and an elephant. Wikipedia. https://en.wikipedia.org/wiki/Blind_men_and_an_elephant. Accessed March 27, 2024. Google ScholarArticle HistoryReceived: Mar 28 2024Revision requested: Apr 2 2024Revision received: Apr 8 2024Accepted: Apr 16 2024Published online: May 21 2024 FiguresReferencesRelatedDetailsRecommended Articles Current Trends in Total Ankle ReplacementRadioGraphics2023Volume: 44Issue: 1Cone-Beam CT of the Extremities in Clinical PracticeRadioGraphics2024Volume: 44Issue: 3Acute Fractures and Dislocations of the Ankle and Foot in ChildrenRadioGraphics2020Volume: 40Issue: 3pp. 754-774Imaging of Acute Capsuloligamentous Sports Injuries in the Ankle and Foot: Sports Imaging SeriesRadiology2017Volume: 283Issue: 3pp. 644-662Hindfoot Fractures: Injury Patterns and Relevant Imaging FindingsRadioGraphics2022Volume: 42Issue: 3pp. 661-682See More RSNA Education Exhibits A Guide to Tibiofibular Syndesmosis: Anatomy, Imaging and Surgical TechniquesDigital Posters2022Imaging Review of Ankle LigamentsDigital Posters2020Chronic Ankle Instability And Microinstability: What You Should Know Digital Posters2021 RSNA Case Collection Peroneus Brevis Split TearRSNA Case Collection2021Osteoid osteoma of the femurRSNA Case Collection2022Tillaux FractureRSNA Case Collection2021 Vol. 311, No. 2 Metrics Altmetric Score PDF download
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
虚心初兰发布了新的文献求助10
2秒前
YYA完成签到 ,获得积分10
2秒前
lvlei完成签到,获得积分10
3秒前
福桃完成签到,获得积分10
3秒前
4秒前
和谐的果汁完成签到,获得积分20
4秒前
5秒前
5秒前
易槐完成签到,获得积分10
5秒前
6秒前
7秒前
__完成签到,获得积分10
7秒前
7秒前
9秒前
虚心初兰完成签到,获得积分20
9秒前
夜已深发布了新的文献求助10
10秒前
科研通AI2S应助stt1011采纳,获得10
11秒前
仿生人发布了新的文献求助10
11秒前
徐反宁发布了新的文献求助10
12秒前
12秒前
熊大对熊二说熊要有个熊样完成签到,获得积分10
13秒前
13秒前
13秒前
14秒前
15秒前
米里迷路发布了新的文献求助10
16秒前
Orange应助欢呼的冰蝶采纳,获得10
17秒前
Alice发布了新的文献求助10
18秒前
称心凡发布了新的文献求助10
18秒前
非主流的毛线完成签到,获得积分10
18秒前
19秒前
文静的飞槐完成签到,获得积分20
19秒前
长安完成签到,获得积分10
21秒前
lywswxn完成签到 ,获得积分10
21秒前
NexusExplorer应助Kz采纳,获得10
21秒前
徐反宁完成签到,获得积分10
22秒前
biofresh完成签到,获得积分10
25秒前
111完成签到 ,获得积分10
26秒前
26秒前
科研通AI5应助愤怒的源智采纳,获得10
27秒前
高分求助中
Encyclopedia of Mathematical Physics 2nd edition 888
Technologies supporting mass customization of apparel: A pilot project 600
Hydropower Nation: Dams, Energy, and Political Changes in Twentieth-Century China 500
Introduction to Strong Mixing Conditions Volumes 1-3 500
Pharmacological profile of sulodexide 400
Optical and electric properties of monocrystalline synthetic diamond irradiated by neutrons 320
共融服務學習指南 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3805375
求助须知:如何正确求助?哪些是违规求助? 3350342
关于积分的说明 10348655
捐赠科研通 3066276
什么是DOI,文献DOI怎么找? 1683655
邀请新用户注册赠送积分活动 809105
科研通“疑难数据库(出版商)”最低求助积分说明 765243