计算机科学
偏离随机性模型
规范化(社会学)
术语歧视
概率逻辑
人工智能
期限(时间)
语言模型
自然语言处理
嵌入
判决
向量空间模型
情报检索
视觉文字
人类学
量子力学
物理
社会学
图像检索
图像(数学)
作者
Fanghong Jian,Jimmy Xiangji Huang,Jiashu Zhao,Zhiwei Ying,Yuqi Wang
摘要
Abstract Many well‐known probabilistic information retrieval models have shown promise for use in document ranking, especially BM25. Nevertheless, it is observed that the control parameters in BM25 usually need to be adjusted to achieve improved performance on different data sets; additionally, the assumption in BM25 on the bag‐of‐words model prevents its direct utilization of rich information that lies at the sentence or document level. Inspired by the above challenges with respect to BM25, we first propose a new normalization method on the term frequency in BM25 (called BM25 QL in this paper); in addition, the method is incorporated into CRTER 2 , a recent BM25‐based model, to construct CRTER 2 QL . Then, we incorporate topic modeling and word embedding into BM25 to relax the assumption of the bag‐of‐words model. In this direction, we propose a topic‐based retrieval model, TopTF, for BM25, which is then further incorporated into the language model (LM) and the multiple aspect term frequency (MATF) model. Furthermore, an enhanced topic‐based term frequency normalization framework, ETopTF, based on embedding is presented. Experimental studies demonstrate the great effectiveness and performance of these methods. Specifically, on all tested data sets and in terms of the mean average precision (MAP), our proposed models, BM25 QL and CRTER 2 QL , are comparable to BM25 and CRTER 2 with the best b parameter value; the TopTF models significantly outperform the baselines, and the ETopTF models could further improve the TopTF in terms of the MAP.
科研通智能强力驱动
Strongly Powered by AbleSci AI