计算机科学
条形码
鉴定(生物学)
分类等级
环境DNA
数据挖掘
GenBank公司
管道(软件)
序列(生物学)
DNA条形码
树(集合论)
推论
采样(信号处理)
参考文献
人工智能
数学
生物
分类单元
基因组
生物多样性
生态学
程序设计语言
计算机视觉
数学分析
遗传学
操作系统
滤波器(信号处理)
基因
生物化学
作者
Shaun Wilkinson,Simon K. Davy,Michael Bunce,Michael Stat
标识
DOI:10.7287/peerj.preprints.26812v1
摘要
High-throughput sequencing of environmental DNA (eDNA) offers a simple and cost-effective solution for marine biodiversity assessments. Yet several analytical challenges remain, including the incorporation of statistical inference in the assignment of taxonomic identities. We developed a probabilistic method for DNA barcode classification that can be used for both eDNA and traditional single-source sampling. The pipeline involves: (1) compiling a primer-specific database of barcode sequences to be used as training data (obtained from GenBank and other sequence repositories), (2) generating a classification tree using an iterative learning algorithm that divisively sorts the training data into hierarchical clusters based on profile hidden Markov models, (3) assignment of each query sequence to a cluster using a recursive series of model-comparison tests, and (4) taxonomic identification of the query sequences based on the lowest common taxonomic rank of the training sequences within the cluster. This method compares favorably to other DNA classification methods when tested on benchmark datasets, and offers the added features of classifying at higher taxonomic ranks and returning interpretable confidence values in the form of the Akaike weight statistic. This bioinformatics pipeline is available as an open source R package called ‘insect’ (informatic sequence classification trees).
科研通智能强力驱动
Strongly Powered by AbleSci AI