计算机科学
自动汇总
人工智能
自然语言处理
可读性
报纸
信息抽取
命名实体识别
任务(项目管理)
情绪分析
情报检索
管理
广告
经济
业务
程序设计语言
作者
Vlad Cristian Dumitru,Denis Iorga,Ştefan Ruşeţi,Mihai Dascălu
标识
DOI:10.1109/cscs59211.2023.00070
摘要
Technological advancement has significantly facilitated the research and development of Artificial Intelligence, with particular emphasis on Natural Language Processing (NLP). High-quality data is crucial to achieving success in this area. This aspect becomes particularly important considering the recent widespread adoption of large language models trained on a considerable amount of text from the Internet. This research expands on the issue of data quality in NLP by examining the impact of automated text extraction techniques from HTML on the performance of specific NLP tasks. For this purpose, an empirical evaluation was conducted to assess the efficacy of various automated techniques for HTML text extraction using 300 news articles written in English, Romanian, and French. The evaluation was conducted by comparing the results of the most popular automated text extraction technologies (i.e., “boiler”, “justext”, “newspaper”, “readability”, and “trafilatura”) against the results of human-validated texts. Both extracted texts, automated and human-validated, were subjected to three NLP tasks: named entity recognition, sentiment analysis, and text summarization. Our analysis of the NLP results indicates that text from Romanian online news articles should be extracted with “newspaper”, whereas “trafilatura” should be used for English and French articles, regardless of the NLP task. Overall, our study provides a comprehensive understanding of the performance of the selected technologies for extracting the text of online news articles by language and NLP task.
科研通智能强力驱动
Strongly Powered by AbleSci AI