局部敏感散列
最近邻搜索
相似性(几何)
计算机科学
修剪
搜索引擎索引
散列函数
数据挖掘
贝叶斯概率
概率逻辑
模式识别(心理学)
精确性和召回率
人工智能
航程(航空)
哈希表
算法
图像(数学)
生物
计算机安全
复合材料
材料科学
农学
作者
Venu Satuluri,Srinivasan Parthasarathy
出处
期刊:Cornell University - arXiv
日期:2011-01-01
标识
DOI:10.48550/arxiv.1110.1328
摘要
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. BayesLSH is able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI