Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models

人工智能 基因组学 计算机科学 序列(生物学) 计算生物学 DNA测序 构造(python库) 机器学习 基因组 生物 遗传学 基因 程序设计语言
作者
Duo Du,Lei Liu,Fan Zhong
标识
DOI:10.1101/2023.12.05.570173
摘要

Abstract Decoding high-quality human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers study the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. This study explores the use of deep learning, particularly pre-trained models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. We meticulously construct multiple datasets linking genotypes and phenotypes to fine-tune pre-trained models for precise DNA sequence classification. Furthermore, we specifically focused on the human endogenous retrovirus (HERV) dataset with commendable classification performance (both binary and multi-classification accuracy and F1 values above 0.935 and 0.888, respectively). We evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the model’s hidden layers using the HERV dataset. To further understand the phenotype-specific patterns learned by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the HERV sequence with high average local representation weight (LRAW) scores. Overall, the generated datasets further provide numerous additional genotype-phenotype datasets for evaluating the performance of genomic models. The findings highlight the potential of large models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research. This work represents an innovative strategy that combines pre-trained model representations with classical omics methods for analyzing the functionality of genome sequences, fostering cross-fertilization between genomics and advanced AI. The source code and data are available at https://github.com/GeorgeBGM/Genome_Fine-Tuning .

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
CWNU_HAN应助干净蘑菇采纳,获得30
刚刚
海鸥完成签到,获得积分10
刚刚
lolo发布了新的文献求助20
1秒前
1秒前
2秒前
赫连人杰完成签到,获得积分10
2秒前
静文完成签到,获得积分10
4秒前
CodeCraft应助坦率曼卉采纳,获得10
4秒前
可爱的函函应助ma采纳,获得10
5秒前
6秒前
柯同完成签到,获得积分10
6秒前
7秒前
8秒前
林清发布了新的文献求助10
8秒前
痴情的梦玉完成签到 ,获得积分10
8秒前
香蕉觅云应助南栀采纳,获得10
9秒前
Ss如意应助nini采纳,获得10
9秒前
10秒前
一条热带鱼完成签到,获得积分10
11秒前
Sir.夏季风完成签到,获得积分10
11秒前
11秒前
我是老大应助。。。采纳,获得10
12秒前
仿若浮云发布了新的文献求助10
12秒前
小清完成签到,获得积分10
12秒前
Cloud完成签到,获得积分10
13秒前
13秒前
13秒前
13秒前
schuang完成签到,获得积分10
14秒前
14秒前
14秒前
14秒前
15秒前
缥缈老九完成签到,获得积分10
16秒前
17秒前
17秒前
蓝天白云发布了新的文献求助30
17秒前
18秒前
YZ发布了新的文献求助10
19秒前
青藤发布了新的文献求助10
19秒前
高分求助中
One Man Talking: Selected Essays of Shao Xunmei, 1929–1939 1000
Yuwu Song, Biographical Dictionary of the People's Republic of China 700
[Lambert-Eaton syndrome without calcium channel autoantibodies] 520
The three stars each: the Astrolabes and related texts 500
少脉山油柑叶的化学成分研究 430
Revolutions 400
Diffusion in Solids: Key Topics in Materials Science and Engineering 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2452212
求助须知:如何正确求助?哪些是违规求助? 2124919
关于积分的说明 5409014
捐赠科研通 1853676
什么是DOI,文献DOI怎么找? 921956
版权声明 562273
科研通“疑难数据库(出版商)”最低求助积分说明 493234