作者
Zhongyi Wang,Haoxuan Zhang,Jiangping Chen,Haihua Chen
摘要
The ex-ante novelty measurement of scientific literature is an essential tool for academic data mining and scientific communication. It can help researchers and peer experts quickly identify highly creative articles among a large number of papers. This paper proposes a framework for novelty measurement of scientific literature based on contribution sentence analysis. In the framework, to obtain the best models for contribution sentence identification and classification, we first implement eight state-of-the-art deep learning models, and compare their performances on contribution sentence identification and classification respectively. The selected contribution sentence identification model achieves the best recall and F1 scores, whose values are 0.963, and 0.929, respectively. The best contribution sentence classification model score 0.897 on Micro F1. Second, to represent each contribution sentence, we generate the contribution sentence cloud in the second part using the BERTopic model and the backward normal cloud generator. In the third part, we calculate the novelty scores of scientific literature using the cloud similarity algorithm. Finally, with the gold standard constructed manually, we perform three comparative experiments with the semantic novelty measurement on the International Conference on Learning Representations (ICLR 2017-2022) dataset. In terms of the correlation analysis results, our measurement has a bigger correlation coefficient with the gold standard than the semantic novelty measurement (0.805>0.580) at a p-value less than 0.0001. In the distribution of differences from the gold standard, our measurement has 2,584 (79.2%) articles falling within the range of ±1.5, compared to 1,519 (46.6%) articles for the semantic novelty measurement. As for boxplots, the results of our measurement are also closer to the gold standard than the semantic novelty measurement. The above experimental results show that our measurement is more feasible and effective than the semantic novelty measurement. Our framework benefits several communities, such as researchers, librarians, science evaluation institutions, policymakers, funding agencies, and others.