分割
解析
文本分割
计算机科学
词(群论)
人工智能
自然语言处理
模式识别(心理学)
尺度空间分割
图像分割
语言学
哲学
作者
Shinya Matsushita,Haruhiko Takase,Toshiaki Takano,Katsuko Tomotsugu
标识
DOI:10.1145/3638209.3638221
摘要
This study discussed text analysis for the preservation of minority languages. Text analysis consists of some steps for the analysis of parsing, etc., but word segmentation is necessary before their advanced analysis. We then focused on word segmentation as the first step in the text analysis. For word segmentation of minority languages without prior knowledge, we considered NPYLM effective, which is an unsupervised method. However, NPYLM caused meaningless segmentation if the training texts were insufficient. Meaningless segmentation is an error where a word is segmented into pieces. In this article, we proposed a method to improve meaningless segmentation. The basic idea of the proposal is that meaningless segmentation is caused by the inappropriate or insufficient growth of words, and we can control the growth of words by selecting words for replacement. We performed simple experiments. The results showed that the proportion of correct words in the obtained word groups is improved: vanilla NPYLM gives about 25%, and the proposed method gives about 50%. Thus, the proposed method suppresses meaningless segmentation.
科研通智能强力驱动
Strongly Powered by AbleSci AI