计算机科学
预处理器
词汇
数据科学
自然语言处理
文本挖掘
情报检索
透明度(行为)
数据预处理
集合(抽象数据类型)
数据挖掘
人工智能
语言学
计算机安全
哲学
程序设计语言
作者
Louis Hickman,Stuti Thapa,Louis Tay,Mengyang Cao,Padmini Srinivasan
标识
DOI:10.1177/1094428120971683
摘要
Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.
科研通智能强力驱动
Strongly Powered by AbleSci AI