A Comparative Study on Embedding Models for Keyword Extraction Using KeyBERT Method

计算机科学 判决 变压器 关键词提取 嵌入 余弦相似度 自然语言处理 改述 水准点(测量) 人工智能 情报检索 数据挖掘 模式识别(心理学) 大地测量学 地理 量子力学 物理 电压
作者
Bayan Issa,Muhammed Basheer Jasser,Hui Na Chua,Muzaffar Hamzah
标识
DOI:10.1109/icset59111.2023.10295108
摘要

KeyBERT is a method for keywords/keyphrases extraction, which has three steps. The first step is selecting candidate keywords from a text using sklearn library, the second step is the embedding operation of the text and its candidate keywords; this operation is done by BERT to get a numerical representation that represents the meanings. The third step is calculating the cosine similarity between individual keywords vectors and document vector. In this paper, we focus on the second step of KeyBERT (embedding step). Although KeyBERT has a lot of supported models for the embedding operation, there are no extensive previous comparative studies to analyze and study the effect of using different supported models in KeyBERT. We introduce a comparative study of two commonly used groups of models; the first group is sentence-transformers pretrained models, supported via the sentence-transformers library, and the second group includes the Longformer model, supported via the Hugginface Transformers library. We conduct the comparative study of models on benchmark datasets, which contain English text documents of multi-domains with different text lengths. Based on the study, we found that the Paraphrase-mpnet-base-v2 model provides the best results among all other models in keyword extraction in terms of effectiveness (f1-score, recall, precision, MAP) on all datasets, with higher efficiency (time) on short text compared with using it on long text; accordingly, we recommend using it in that context. On the other hand, the Longformer model is the most efficient/fastest in keyword extraction among all other models on all datasets and this superiority has been evident, especially in long text; accordingly, we recommend using it in that context.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
刚刚
小迷糊完成签到 ,获得积分10
刚刚
孔凡越发布了新的文献求助10
4秒前
一二三发布了新的文献求助30
5秒前
6秒前
渺小完成签到,获得积分10
6秒前
7秒前
7秒前
周涛发布了新的文献求助10
8秒前
8秒前
2052669099发布了新的文献求助30
10秒前
虞丹萱发布了新的文献求助10
11秒前
田様应助zzy采纳,获得10
11秒前
酸甜完成签到,获得积分10
13秒前
as发布了新的文献求助10
13秒前
丘奇发布了新的文献求助10
13秒前
14秒前
上官若男应助马里奥采纳,获得10
16秒前
求知完成签到,获得积分10
17秒前
19秒前
Orange应助独孤磕盐采纳,获得10
20秒前
20秒前
笨笨的寒烟完成签到,获得积分10
21秒前
21秒前
23秒前
seall完成签到,获得积分10
24秒前
科研通AI6.4应助ZZN采纳,获得10
24秒前
Ava应助周涛采纳,获得30
24秒前
25秒前
酆老头发布了新的文献求助40
25秒前
哈哈哈发布了新的文献求助10
26秒前
无辜曼容发布了新的文献求助10
26秒前
靓丽迎梦发布了新的文献求助10
26秒前
27秒前
蓝天应助科研通管家采纳,获得10
28秒前
英俊qiang应助科研通管家采纳,获得10
28秒前
完美世界应助科研通管家采纳,获得10
28秒前
Hello应助科研通管家采纳,获得10
28秒前
科目三应助科研通管家采纳,获得30
28秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Les Mantodea de Guyane Insecta, Polyneoptera 2000
Leading Academic-Practice Partnerships in Nursing and Healthcare: A Paradigm for Change 800
Signals, Systems, and Signal Processing 610
Research Methods for Business: A Skill Building Approach, 9th Edition 500
Research Methods for Applied Linguistics 500
Picture Books with Same-sex Parented Families Unintentional Censorship 444
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6415074
求助须知:如何正确求助?哪些是违规求助? 8233974
关于积分的说明 17484690
捐赠科研通 5467925
什么是DOI,文献DOI怎么找? 2888960
邀请新用户注册赠送积分活动 1865828
关于科研通互助平台的介绍 1703506