计算机科学
判决
变压器
关键词提取
嵌入
余弦相似度
自然语言处理
改述
水准点(测量)
人工智能
情报检索
数据挖掘
模式识别(心理学)
电压
大地测量学
地理
物理
量子力学
作者
Bayan Issa,Muhammed Basheer Jasser,Hui Na Chua,Muzaffar Hamzah
标识
DOI:10.1109/icset59111.2023.10295108
摘要
KeyBERT is a method for keywords/keyphrases extraction, which has three steps. The first step is selecting candidate keywords from a text using sklearn library, the second step is the embedding operation of the text and its candidate keywords; this operation is done by BERT to get a numerical representation that represents the meanings. The third step is calculating the cosine similarity between individual keywords vectors and document vector. In this paper, we focus on the second step of KeyBERT (embedding step). Although KeyBERT has a lot of supported models for the embedding operation, there are no extensive previous comparative studies to analyze and study the effect of using different supported models in KeyBERT. We introduce a comparative study of two commonly used groups of models; the first group is sentence-transformers pretrained models, supported via the sentence-transformers library, and the second group includes the Longformer model, supported via the Hugginface Transformers library. We conduct the comparative study of models on benchmark datasets, which contain English text documents of multi-domains with different text lengths. Based on the study, we found that the Paraphrase-mpnet-base-v2 model provides the best results among all other models in keyword extraction in terms of effectiveness (f1-score, recall, precision, MAP) on all datasets, with higher efficiency (time) on short text compared with using it on long text; accordingly, we recommend using it in that context. On the other hand, the Longformer model is the most efficient/fastest in keyword extraction among all other models on all datasets and this superiority has been evident, especially in long text; accordingly, we recommend using it in that context.
科研通智能强力驱动
Strongly Powered by AbleSci AI