可读性
计算机科学
自然语言处理
越南语
人工智能
特征(语言学)
任务(项目管理)
语义学(计算机科学)
光学(聚焦)
标杆管理
理解力
领域(数学分析)
情报检索
语言学
经济
营销
业务
管理
哲学
程序设计语言
数学分析
物理
数学
光学
作者
Nam-Thuan Doan,Thi-Anh-Thi Le,An-Vinh Lương,Điền Đinh
标识
DOI:10.1145/3548636.3548643
摘要
Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.
科研通智能强力驱动
Strongly Powered by AbleSci AI