基因组
可解释性
计算机科学
超参数
计算生物学
机器学习
人工智能
数据挖掘
生物
基因
遗传学
作者
Mohammad Saleh Refahi,Bahrad A. Sokhansanj,Gail Rosen
标识
DOI:10.1109/spmb59478.2023.10372773
摘要
Analyzing sequencing data from microbiome experiments is challenging, since samples can contain tens of thousands of unique taxa (and their genes) and populations of millions of cells. Reducing the dimensionality of metagenomic data is a crucial step in improving the interpretability of complex genetic information, as metagenomic datasets typically encompass a wide range of genetic diversity and variations.In this study, we implement RoBERTa, a state-of-the-art large language model, and pre-train it on relatively large genomic datasets to obtain a model that can be used to generate embeddings that can help simplify complex metagenomic data sets. The pre-training process enables RoBERTa to capture the inherent characteristics and patterns present in the genomic sequences. We then evaluate the effectiveness of embeddings generated using the pre-trained RoBERTa model in downstream tasks, with a particular focus on taxonomic classification. To assess whether our method can be generalizable, we conduct extensive downstream analysis on three distinct datasets: 16s rRNA, 28s rRNA, and ITS. By utilizing datasets containing 16S rRNA exclusive to bacteria and eukaryotic mitochondria, as well as datasets containing 28S rRNA and ITS specific to eukaryotes (such as fungi), we were able to assess the performance of RoBERTa embeddings across diverse genomic regions. We tune the RoBERTa model through hyperparameter optimization on each dataset. Our results demonstrate that RoBERTa embeddings exhibit promising results in taxonomic classification compared to conventional methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI