素描
计算机科学
散列函数
可扩展性
数据流
数据流挖掘
数据挖掘
数据结构
比例(比率)
大方坯过滤器
哈希表
理论计算机科学
动态完美哈希
概率逻辑
情报检索
数据库
算法
人工智能
双重哈希
物理
程序设计语言
电信
量子力学
计算机安全
作者
Yongqiang Liu,Xike Xie
标识
DOI:10.1145/3442381.3449984
摘要
Conventional sketching methods on counting stream item frequencies use hash functions for mapping data items to a concise structure, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, however, the accumulated errors originated in hashing collisions deteriorate the sketching accuracies at the rapid pace of data increasing, which poses a great challenge to sketch big data streams at web scale. In this paper, we propose a novel structure, called XY-sketch, which estimates the frequency of a data item by estimating the probability of this item appearing in the data stream. The framework associated with XY-sketch consists of two phases, namely decomposition and recomposition phases. A data item is split into a set of compactly stored basic elements, which can be stringed up in a probabilistic manner for query evaluation during the recomposition phase. Throughout, we conduct optimization under space constraints and detailed theoretical analysis. Experiments on both real and synthetic datasets are done to show the superior scalability on sketching large-scale streams. Remarkably, XY-sketch is orders of magnitudes more accurate than existing solutions, when the space budget is small.
科研通智能强力驱动
Strongly Powered by AbleSci AI