亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Fine-tuning large language models for rare disease concept normalization

计算机科学 规范化(社会学) 自然语言处理 微调 判决 标识符 人工智能 集合(抽象数据类型) 语言模型 程序设计语言 社会学 人类学 物理 量子力学
作者
Andy Wang,Cong Liu,Jingye Yang,Chunhua Weng
出处
期刊:Journal of the American Medical Informatics Association [Oxford University Press]
卷期号:31 (9): 2076-2083 被引量:6
标识
DOI:10.1093/jamia/ocae133
摘要

Abstract Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept’s synonyms as well as identifiers. Subsequently, we fine-tuned Llama 2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ∼20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen’s terms. Our approach provides a solution for the use of LLMs to identify named medical entities from clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
20秒前
25秒前
汤万天完成签到,获得积分10
36秒前
Hans完成签到,获得积分10
53秒前
maozl完成签到 ,获得积分10
1分钟前
Cecilia完成签到,获得积分20
1分钟前
科研通AI5应助隐形的绮山采纳,获得10
1分钟前
2分钟前
不羁完成签到 ,获得积分10
2分钟前
2分钟前
怕黑鲂完成签到 ,获得积分10
2分钟前
2分钟前
2分钟前
2分钟前
3分钟前
3分钟前
小真白发布了新的文献求助10
3分钟前
4分钟前
kukudou2发布了新的文献求助10
4分钟前
阿菜完成签到,获得积分10
4分钟前
4分钟前
5分钟前
5分钟前
5分钟前
Yuson_L应助zhj采纳,获得10
5分钟前
烟花应助lixiaoxia采纳,获得10
5分钟前
5分钟前
lixiaoxia发布了新的文献求助10
5分钟前
6分钟前
LiS发布了新的文献求助10
6分钟前
忧郁的蟑螂王完成签到 ,获得积分10
6分钟前
闪闪映易完成签到,获得积分10
6分钟前
7分钟前
科研通AI5应助caicainuegou采纳,获得10
7分钟前
科研通AI5应助隐形的绮山采纳,获得10
7分钟前
7分钟前
和谐的抽屉完成签到 ,获得积分10
7分钟前
7分钟前
7分钟前
7分钟前
高分求助中
Mass producing individuality 600
Algorithmic Mathematics in Machine Learning 500
Разработка метода ускоренного контроля качества электрохромных устройств 500
A Combined Chronic Toxicity and Carcinogenicity Study of ε-Polylysine in the Rat 400
Advances in Underwater Acoustics, Structural Acoustics, and Computational Methodologies 300
NK Cell Receptors: Advances in Cell Biology and Immunology by Colton Williams (Editor) 200
Effect of clapping movement with groove rhythm on executive function: focusing on audiomotor entrainment 200
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3827228
求助须知:如何正确求助?哪些是违规求助? 3369590
关于积分的说明 10456499
捐赠科研通 3089256
什么是DOI,文献DOI怎么找? 1699745
邀请新用户注册赠送积分活动 817497
科研通“疑难数据库(出版商)”最低求助积分说明 770251