生物
瓶颈
顺序装配
计算生物学
RNA序列
管道(软件)
参考基因组
计算机科学
工作流程
DNA测序
基因
数据挖掘
转录组
遗传学
数据库
程序设计语言
基因表达
嵌入式系统
作者
Peng Liu,Jessica Ewald,José Héctor Gálvez,Jessica Head,Doug Crump,Guillaume Bourque,Niladri Basu,Jianguo Xia
出处
期刊:Genome Research
[Cold Spring Harbor Laboratory]
日期:2021-03-17
卷期号:31 (4): 713-720
被引量:25
标识
DOI:10.1101/gr.269894.120
摘要
Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.
科研通智能强力驱动
Strongly Powered by AbleSci AI