聚类分析
计算机科学
启发式
序列(生物学)
数据挖掘
单连锁聚类
相关聚类
灵敏度(控制系统)
CURE数据聚类算法
星团(航天器)
人工智能
生物
工程类
电子工程
遗传学
程序设计语言
作者
Ming Cao,Qinke Peng,Ze-Gang Wei,Fei Liu,Yifan Hou
标识
DOI:10.1142/s0219720021500360
摘要
The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.
科研通智能强力驱动
Strongly Powered by AbleSci AI