Groundhog day

计算机科学 微博 情报检索 社会化媒体 过程(计算) 质量(理念) 语义相似性 相似性(几何) 万维网 数据挖掘 人工智能 认识论 操作系统 图像(数学) 哲学
作者
Ke Tao,Fabian Abel,Claudia Hauff,Geert‐Jan Houben,Ujwal Gadiraju
出处
期刊:The Web Conference 被引量:49
标识
DOI:10.1145/2488388.2488499
摘要

With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
建议保存本图,每天支付宝扫一扫(相册选取)领红包
实时播报
传奇3应助仇剑封采纳,获得10
5秒前
12秒前
Jenny发布了新的文献求助10
14秒前
20秒前
gyj发布了新的文献求助10
25秒前
舒心莫言完成签到,获得积分10
28秒前
香蕉觅云应助科研通管家采纳,获得10
29秒前
30秒前
30秒前
CipherSage应助科研通管家采纳,获得10
30秒前
ding应助科研通管家采纳,获得10
30秒前
Owen应助科研通管家采纳,获得10
30秒前
情怀应助科研通管家采纳,获得10
30秒前
爆米花应助科研通管家采纳,获得10
30秒前
xkm6666应助科研通管家采纳,获得10
30秒前
Lucas应助科研通管家采纳,获得10
30秒前
巨人文完成签到,获得积分10
31秒前
在水一方应助gyj采纳,获得10
31秒前
张有志完成签到,获得积分10
32秒前
张有志发布了新的文献求助10
35秒前
geo_xl完成签到 ,获得积分10
37秒前
酷波er应助Crystalluo采纳,获得10
37秒前
38秒前
淋雨的猪发布了新的文献求助10
39秒前
daheeeee完成签到,获得积分10
41秒前
41秒前
陈谨完成签到 ,获得积分10
42秒前
42秒前
画画的baby完成签到,获得积分10
42秒前
充电宝应助仇剑封采纳,获得10
43秒前
45秒前
SciGPT应助郁郁葱葱采纳,获得10
49秒前
852应助Tom哥采纳,获得10
49秒前
淋雨的猪完成签到,获得积分10
50秒前
50秒前
大Doctor陈完成签到,获得积分10
51秒前
gjww应助杰king采纳,获得10
55秒前
大Doctor陈发布了新的文献求助10
55秒前
正直凌文完成签到 ,获得积分10
55秒前
爱炸鸡也爱烧烤完成签到 ,获得积分10
56秒前
高分求助中
Teaching Social and Emotional Learning in Physical Education 1100
The Instrument Operations and Calibration System for TerraSAR-X 800
grouting procedures for ground source heat pump 500
The Chemistry of Carbonyl Compounds and Derivatives 400
Polyvinyl alcohol fibers 300
A Monograph of the Colubrid Snakes of the Genus Elaphe 300
An Annotated Checklist of Dinosaur Species by Continent 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2344858
求助须知:如何正确求助?哪些是违规求助? 2045521
关于积分的说明 5102782
捐赠科研通 1782538
什么是DOI,文献DOI怎么找? 890776
版权声明 556560
科研通“疑难数据库(出版商)”最低求助积分说明 475177