BioInstruct: instruction tuning of large language models for biomedical natural language processing

计算机科学 任务(项目管理) 生物医学文本挖掘 领域(数学分析) 自然语言处理 自然语言 人工智能 自然(考古学) 语言模型 文本挖掘 数学分析 数学 管理 考古 经济 历史
作者
Hieu Tran,Zhichao Yang,Zonghai Yao,Hong Yu
出处
期刊:Journal of the American Medical Informatics Association [Oxford University Press]
卷期号:31 (9): 1821-1832 被引量:10
标识
DOI:10.1093/jamia/ocae122
摘要

Abstract Objectives To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. Materials and Methods We created the BioInstruct, comprising 25 005 instructions to instruction-tune LLMs (LLaMA 1 and 2, 7B and 13B version). The instructions were created by prompting the GPT-4 language model with 3-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into 3 major categories: question answering (QA), information extraction (IE), and text generation (GEN). We also examined whether categories (eg, QA, IE, and generation) of instructions impact model performance. Results and Discussion Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA on average accuracy metric, 5.7% in IE on average F1 metric, and 96% in Generation tasks on average GPT-4 score metric. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between 2 tasks. Conclusion The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
畔畔应助ceeray23采纳,获得20
刚刚
1秒前
1秒前
2秒前
2秒前
闵傲南发布了新的文献求助10
2秒前
guohuameike完成签到,获得积分10
2秒前
时光友岸完成签到,获得积分10
3秒前
4秒前
书筠完成签到,获得积分10
4秒前
Ryo完成签到,获得积分10
4秒前
5秒前
5秒前
5秒前
mw发布了新的文献求助10
6秒前
7秒前
研小玥发布了新的文献求助10
7秒前
小陈完成签到 ,获得积分10
7秒前
HKJ发布了新的文献求助10
8秒前
风趣的如萱完成签到 ,获得积分10
8秒前
molihuakai应助千山采纳,获得10
8秒前
Xiaojiu发布了新的文献求助10
9秒前
balala发布了新的文献求助10
9秒前
0x3f发布了新的文献求助10
10秒前
克劳修斯完成签到 ,获得积分10
10秒前
月月应助清水烫春菜采纳,获得10
11秒前
mawari发布了新的文献求助10
11秒前
CR7应助我真的服了采纳,获得20
13秒前
顾矜应助Zp采纳,获得10
14秒前
mawari完成签到,获得积分10
17秒前
Ustinian发布了新的文献求助10
20秒前
20秒前
22秒前
醉清风完成签到 ,获得积分10
24秒前
科研通AI6.2应助晚阳采纳,获得10
24秒前
千山发布了新的文献求助10
25秒前
12345完成签到,获得积分10
26秒前
Eden完成签到,获得积分10
27秒前
27秒前
27秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Salmon nasal cartilage-derived proteoglycan complexes influence the gut microbiota and bacterial metabolites in mice 2000
The Composition and Relative Chronology of Dynasties 16 and 17 in Egypt 1500
Picture this! Including first nations fiction picture books in school library collections 1500
ON THE THEORY OF BIRATIONAL BLOWING-UP 666
Signals, Systems, and Signal Processing 610
The Impostor Phenomenon: When Success Makes You Feel Like a Fake 600
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6377671
求助须知:如何正确求助?哪些是违规求助? 8190844
关于积分的说明 17302972
捐赠科研通 5431284
什么是DOI,文献DOI怎么找? 2873421
邀请新用户注册赠送积分活动 1850068
关于科研通互助平台的介绍 1695387