亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

The application of multimodal large language models in medicine

基础(证据) 计算机科学 功能(生物学) 语言模型 航程(航空) 人工智能 自然语言处理 工程类 历史 生物 进化生物学 航空航天工程 考古
作者
Jianing Qiu,Wu Yuan,Kyle Lam
出处
期刊:The Lancet Regional Health - Western Pacific [Elsevier BV]
卷期号:45: 101048-101048 被引量:9
标识
DOI:10.1016/j.lanwpc.2024.101048
摘要

In September 2023, OpenAI released GPT-4V,1OpenAI GPT-4V(ision) system card.2023https://cdn.openai.com/papers/GPTV_System_Card.pdfDate accessed: October 18, 2023Google Scholar a multimodal foundation model2Qiu J. Li L. Sun J. et al.Large AI models in health informatics: applications, challenges, and the future.IEEE J Biomed Health Inform. 2023; : 1-14Google Scholar,3Moor M. Banerjee O. Abad Z.S.H. et al.Foundation models for generalist medical artificial intelligence.Nature. 2023; 616: 259-265Crossref PubMed Scopus (131) Google Scholar connecting large language models (LLMs) with vision input. Foundation models, defined as large AI models which are trained upon vast datasets that can be later adapted to a range of downstream tasks,4Bommasani R. Hudson D.A. Adeli E. et al.On the opportunities and risks of foundation models.arXiv. 2021; https://doi.org/10.48550/arXiv.2108.07258Crossref Scopus (0) Google Scholar represent the latest wave in AI research. Differing from specific AI models which are trained for a single function, foundation models are designed to be multi-purpose. The most widely known of these are the GPT models, powering ChatGPT, which previously could input language alone. Language represents only a proportion of the data encountered within healthcare and previously this limitation of 'unimodal' AI including LLMs has meant that vital data sources such as radiology, endoscopic images and laboratory investigations have not been incorporated. However, OpenAI's latest offering, GPT-4V, allows image input; in conjunction with Whisper,5Radford A. Kim J.W. Xu T. Brockman G. Mcleavey C. Sutskever I. Robust speech recognition via large-scale weak supervision.in: Andreas K. Emma B. Kyunghyun C. Barbara E. Sivan S. Jonathan S. Proceedings of the 40th international conference on machine learning. Proceedings of machine learning Research. PMLR, 2023: 28492-28518Google Scholar an automatic speech recognition system, and text-to-speech generation techniques, this now means that ChatGPT can see, hear, and speak.6OpenAI ChatGPT can now see, hear, and speak.2023https://openai.com/blog/chatgpt-can-now-see-hear-and-speakGoogle Scholar ChatGPT's leap into multimodality therefore opens new horizons for clinical work processes and applications. Here, we highlight four example areas where multimodal LLMs can benefit clinicians in an example scenario of a patient presenting with small bowel obstruction (Fig. 1) and across varying specialties (Supplement). Multimodal LLMs empower LLMs further through seamless transcription and summarisation of speech data, allowing generation of clinical records or letters directly from the doctor-patient consult (Fig. 1); this could reduce the burden of clinical documentation significantly. Secondly, multimodal LLMs build upon existing AI image interpretation through their ability to integrate existing information including the patient's history, indications for imaging, and comparisons with previous imaging, and by offering recommendations (Fig. 1). They can reduce the need for large datasets through their few-shot or zero-shot learning abilities (completing a task through limited or no training examples respectively) and support visual prompting to refine the prediction (for example, a user can manually indicate the region of interest within an image).7Kirillov A. Mintun E. Ravi N. et al.Segment anything.arXiv. 2023; https://doi.org/10.48550/arXiv.2304.02643Crossref Scopus (0) Google Scholar,8Yang Z. Li L. Lin K. et al.The dawn of LMMs: preliminary explorations with GPT-4V(ision)2023.https://ui.adsabs.harvard.edu/abs/2023arXiv230917421YDate accessed: October 18, 2023Google Scholar Thirdly, optical character recognition empowers multimodal LLMs to detect numerical and text (irrespective of the language used) from image input (Fig. 1). Finally, capabilities in video understanding could allow automatic documentation of procedural notes, improving efficiency and accuracy of documentation. Scene understanding–for example identification of anatomical landmarks–could open doors to procedure assistance, augment clinician capabilities, and ultimately lead to improved clinical outcomes. These newfound capabilities of multimodal LLMs also pose newfound challenges for their adoption within healthcare. Hallucinations, where the model outputs incorrect or nonsensical information, are a fundamental issue. In our example, the multimodal ChatGPT outputs an incorrect interpretation of an ECG which is convincing at face-value (Fig. 1). Exploratory studies1OpenAI GPT-4V(ision) system card.2023https://cdn.openai.com/papers/GPTV_System_Card.pdfDate accessed: October 18, 2023Google Scholar,8Yang Z. Li L. Lin K. et al.The dawn of LMMs: preliminary explorations with GPT-4V(ision)2023.https://ui.adsabs.harvard.edu/abs/2023arXiv230917421YDate accessed: October 18, 2023Google Scholar have also shown that GPT-4V can hallucinate while responding to vision-based queries, secondary to either incorrect reasoning from the underlying LLM or incorrect recognition of visual content. Input of increasing numbers of clinical data types is a concern as broader expertise will be required to determine ground truth resulting in greater challenges in identifying the source of hallucinations. Reliability must therefore be improved in order to meet the high threshold required for translation into clinical practice. Secondly, increasing data modalities will lead to greater privacy concerns. The growing size of foundation models threatens the accidental exposure of patient data from the data used within its training process. Data modalities such as speech and video threaten not only the privacy of patients but also clinicians themselves. Finally, regulation of multimodal LLMs presents a significant challenge. While task specific AI requires validation only for the task it is designed for, the emergent intelligence of foundation models (where future capabilities of the model are still to be discovered) demands a rethink for regulators in the approach taken both to test models and mitigate against AI failure. It is likely that foundation models will not fit neatly into existing regulation and require novel custom solutions. This should be a key priority for translation as innovation is likely to outpace regulation. One potential approach is to adapt existing regulation for anticipated downstream applications from the foundation model and monitor for emerging functions, risks, and failures. However, multimodal LLMs are trained on huge amounts of data taken from the internet, and it is difficult for users to know what data are being used to train them. This calls into question the traditional approach of validating models upon public benchmarks as the data used to train the multimodal LLMs may have included these benchmarks. Thus, regulators need to establish isolated validation datasets which are inaccessible to model developers, and conduct independent examinations using such datasets to ensure trustworthy and objective validation. Despite these challenges, multimodal AI powered by foundation models offers significant promise in augmenting the medical workforce in clinical decision-making and management. The release of GPT-4V will spark future endeavours in the responsible development, use, and regulation of multimodal medical AI, and the improvement of AI trustworthiness and accessibility in medicine. The authors declare no competing interests. Funding: Funding and infrastructural support was provided by the NIHR Imperial Biomedical Research Centre. Kyle Lam is supported by a NIHR Academic Clinical Fellowship. Download .docx (.01 MB) Help with docx files Supplementary Table
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
gexzygg应助科研通管家采纳,获得10
15秒前
gexzygg应助科研通管家采纳,获得10
15秒前
光合作用完成签到,获得积分10
24秒前
Galri完成签到 ,获得积分10
42秒前
量子星尘发布了新的文献求助10
53秒前
1分钟前
松松完成签到 ,获得积分10
2分钟前
可可完成签到 ,获得积分10
2分钟前
2分钟前
gexzygg应助科研通管家采纳,获得20
2分钟前
打打应助LukeLion采纳,获得20
2分钟前
2分钟前
有热心愿意完成签到,获得积分10
2分钟前
2分钟前
LukeLion发布了新的文献求助20
2分钟前
量子星尘发布了新的文献求助10
2分钟前
cllcx完成签到,获得积分10
3分钟前
3分钟前
量子星尘发布了新的文献求助50
4分钟前
奈思完成签到 ,获得积分10
4分钟前
领导范儿应助科研通管家采纳,获得10
4分钟前
4分钟前
vincy完成签到 ,获得积分0
4分钟前
4分钟前
5分钟前
量子星尘发布了新的文献求助10
5分钟前
h0jian09完成签到,获得积分10
5分钟前
6分钟前
gexzygg应助科研通管家采纳,获得20
6分钟前
6分钟前
量子星尘发布了新的文献求助10
6分钟前
Diamond完成签到 ,获得积分10
6分钟前
yzhilson完成签到 ,获得积分0
7分钟前
7分钟前
貔貅完成签到 ,获得积分10
8分钟前
8分钟前
英俊的铭应助木木采纳,获得10
8分钟前
科研通AI2S应助科研通管家采纳,获得10
8分钟前
科研通AI6应助科研通管家采纳,获得10
8分钟前
量子星尘发布了新的文献求助10
8分钟前
高分求助中
(禁止应助)【重要!!请各位详细阅读】【科研通的精品贴汇总】 10000
Organic Chemistry 1500
The Netter Collection of Medical Illustrations: Digestive System, Volume 9, Part III - Liver, Biliary Tract, and Pancreas (3rd Edition) 600
Introducing Sociology Using the Stuff of Everyday Life 400
Conjugated Polymers: Synthesis & Design 400
Picture Books with Same-sex Parented Families: Unintentional Censorship 380
Metals, Minerals, and Society 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 冶金 细胞生物学 免疫学
热门帖子
关注 科研通微信公众号,转发送积分 4261670
求助须知:如何正确求助?哪些是违规求助? 3794653
关于积分的说明 11899308
捐赠科研通 3441739
什么是DOI,文献DOI怎么找? 1888746
邀请新用户注册赠送积分活动 939502
科研通“疑难数据库(出版商)”最低求助积分说明 844525