计算机科学
任务(项目管理)
序列(生物学)
安全性令牌
人工智能
机器学习
二元分类
DNA测序
DNA
生物
支持向量机
遗传学
工程类
计算机安全
系统工程
作者
Daoan Zhang,Weitong Zhang,Bing He,Jianguo Zhang,Chenchen Qin,Jianhua Yao
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:7
标识
DOI:10.48550/arxiv.2307.05628
摘要
Pre-trained large language models demonstrate potential in extracting information from DNA sequences, yet adapting to a variety of tasks and data modalities remains a challenge. To address this, we propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a numerical regression task (guanine-cytosine content prediction), and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks while processing both sequence and numerical data. Our evaluation of genomic signal and region recognition, mRNA abundance regression, and artificial genomes generation tasks demonstrates DNAGPT's superior performance compared to existing models designed for specific downstream tasks, benefiting from pre-training using the newly designed model structure.
科研通智能强力驱动
Strongly Powered by AbleSci AI