局部敏感散列
计算机科学
最近邻搜索
k-最近邻算法
散列函数
聚类分析
数据挖掘
最近邻链算法
序列(生物学)
星团(航天器)
最佳垃圾箱优先
理论计算机科学
哈希表
人工智能
生物
相关聚类
计算机安全
树冠聚类算法
遗传学
程序设计语言
作者
Souha S. Kanj,Thomas Brüls,Stéphane Gazut
摘要
Abstract We present a new algorithm to cluster high dimensional sequence data, and its application to the field of metagenomics, which aims to reconstruct individual genomes from a mixture of genomes sampled from an environ-mental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, e.g., using the shared nearest neighbors rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new method based on combining the shared nearest neighbor (SNN) rule with the concept of Locality Sensitive Hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and, employing the shared nearest neighbor rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.
科研通智能强力驱动
Strongly Powered by AbleSci AI