钥匙(锁)
鉴定(生物学)
探索性研究
星团(航天器)
计算机科学
语言学
社会学
哲学
社会科学
计算机安全
生物
植物
程序设计语言
摘要
Abstract Various methods have been developed for identifying keywords/key clusters. Most of these methods use a reference corpus to identify keywords/key clusters in the target corpus although a few studies have employed methods for key word/cluster identification without the use of a reference corpus. However, little research appears to have been done comparing the effectiveness of these methods, especially when they are used for identifying key clusters, a relatively new concept than keywords. To address this research gap, this study compares the accuracy and effectiveness of the following five methods in identifying key clusters in a corpus of Charles Dickens’s novels without the use of a reference corpus: TF (Term Frequency, a common frequency measure), DPnorm (Deviation of Proportions normalized, a robust and effective dispersion measure), and PPMI (Positive Pointwise Information, a widely used association strength measure), and TF-IDF (Term Frequency—Inverse document, a blended method that considers both term frequency and inverse document frequency), and TF-DPnorm (Term Frequency-DP normalized), a self-developed blended method that factors in both frequency and normalized dispersion. With the top key clusters that Mahlberg (2007) identified in the same Dickens’s corpus of novels as the benchmark, the results of the comparison show that, of the five methods, the self-developed TF-DPnorm method and the TF method are the most accurate and effective in identifying key clusters in literary texts when no reference corpus is used. Reasons for the differences across the methods are explored and research implications are also discussed.
科研通智能强力驱动
Strongly Powered by AbleSci AI