基因组
比例(比率)
计算机科学
跟踪(教育)
2019年冠状病毒病(COVID-19)
进化动力学
严重急性呼吸综合征冠状病毒2型(SARS-CoV-2)
路径(计算)
动力学(音乐)
计算生物学
生物
基因
遗传学
地理
程序设计语言
物理
传染病(医学专业)
人口
教育学
病理
社会学
心理学
医学
人口学
地图学
疾病
声学
作者
Maxim Zvyagin,Alexander Brace,Kyle Hippe,Yuntian Deng,Bin Zhang,Cindy Orozco Bohorquez,Austin Clyde,Bharat Kale,Danilo Perez-Rivera,Heng Ma,Carla M. Mann,Michael Irvin,Defne G. Ozgulbas,Natalia Vassilieva,J. Gregory Pauloski,Logan Ward,Valérie Hayot-Sasson,Murali Emani,Sam Foreman,Zhen Xie
标识
DOI:10.1177/10943420231201154
摘要
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
科研通智能强力驱动
Strongly Powered by AbleSci AI