Benchmarking Large Language Models in Evidence-Based Medicine

标杆管理 计算机科学 自然语言处理 数据科学 人工智能 业务 营销
作者
Jin Li,Yiyan Deng,Qi Sun,Junjie Zhu,Yu Tian,Jingsong Li,Tingting Zhu
出处
期刊:IEEE Journal of Biomedical and Health Informatics [Institute of Electrical and Electronics Engineers]
卷期号:: 1-14 被引量:2
标识
DOI:10.1109/jbhi.2024.3483816
摘要

Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research. The code is publicly available on Github.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
一杯美事发布了新的文献求助10
刚刚
刚刚
duan完成签到,获得积分10
刚刚
现实的飞风完成签到,获得积分10
1秒前
子小孙发布了新的文献求助10
1秒前
1秒前
2秒前
2秒前
马尔风完成签到,获得积分10
2秒前
2秒前
3秒前
0201发布了新的文献求助10
3秒前
3秒前
米饭多加水完成签到 ,获得积分10
4秒前
4秒前
默默菲音完成签到,获得积分20
5秒前
chen完成签到,获得积分10
5秒前
5秒前
QQ发布了新的文献求助10
5秒前
6秒前
6秒前
6秒前
7秒前
科研通AI5应助机密塔采纳,获得10
7秒前
Fiona000001完成签到,获得积分10
7秒前
莫非发布了新的文献求助10
7秒前
追寻的筝发布了新的文献求助10
7秒前
肥猫发布了新的文献求助10
8秒前
qq完成签到 ,获得积分10
8秒前
JamesPei应助圆彰七大采纳,获得10
9秒前
熊小子爱学习完成签到,获得积分10
9秒前
一杯美事完成签到,获得积分10
10秒前
凌凌漆应助黄小花采纳,获得10
10秒前
科研通AI2S应助黄小花采纳,获得10
10秒前
Nancy完成签到,获得积分10
10秒前
酷炫甜瓜发布了新的文献求助10
10秒前
11秒前
luyao970131发布了新的文献求助10
11秒前
无限雨南发布了新的文献求助10
11秒前
崇林同学完成签到 ,获得积分10
11秒前
高分求助中
Applied Survey Data Analysis (第三版, 2025) 800
Assessing and Diagnosing Young Children with Neurodevelopmental Disorders (2nd Edition) 700
Images that translate 500
引进保护装置的分析评价八七年国外进口线路等保护运行情况介绍 500
Algorithmic Mathematics in Machine Learning 500
Handbook of Innovations in Political Psychology 400
Mapping the Stars: Celebrity, Metonymy, and the Networked Politics of Identity 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3841351
求助须知:如何正确求助?哪些是违规求助? 3383439
关于积分的说明 10529854
捐赠科研通 3103519
什么是DOI,文献DOI怎么找? 1709323
邀请新用户注册赠送积分活动 823096
科研通“疑难数据库(出版商)”最低求助积分说明 773813