Text classification with improved word embedding and adaptive segmentation

文字2vec 计算机科学词（群论）人工智能文字嵌入嵌入文本分割分割序列（生物学）集合（抽象数据类型）模式识别（心理学）滤波器（信号处理）语音识别数学生物遗传学程序设计语言计算机视觉几何学

作者

Guoying Sun,Yanan Cheng,Zhaoxin Zhang,Xiaojun Tong,Tingting Chai

出处

期刊：Expert Systems With Applications [Elsevier BV]
日期：2024-03-01 卷期号：238: 121852-121852 被引量：2

标识

DOI：10.1016/j.eswa.2023.121852

摘要

Text classification first needs to convert the text into embedding vectors. Considering that static word embedding models such as Word2vec do not consider the position information of word and the difference of its role in different documents, while dynamic word embedding models such as Bert consume a large amount of time. An improved word embedding model based on pre-trained Word2vec is proposed, which achieves better classification accuracy and much lower classification time than Bert. At first, the concept of Term Document Frequency (TDF) is proposed on the basis of TF-IDF, and the TF-IDF-TDF of each word in different documents is calculated. Then, The positional encoding is added. Finally, in order to reduce the misleading of words with low importance, a filter is designed to set the embedding vector with low importance to zero. Considering that the sequence length that the deep learning model can handle is limited, and the text sequence exceeding the Maximum Sequence Length (MSL) set by the deep learning model will be directly truncated and discarded, an adaptive segmentation model is proposed, which can set different segmentation strategies for different texts according to the length of the text and the MSL. In order to maintain the continuity of adjacent text after segmentation, an adjacent-segment-vector-attended co-attention network is designed. In addition, the multi-channel convolution and the capsule network are designed to further extract deep hidden features. Multiple comparative experiment results show that the proposed model achieves the best Accuracy and Micro-F1 on five long text baseline datasets and six short text baseline datasets. In addition, when the MSL is not set too large compared with the document length in the dataset, the classification results of the proposed model are not affected by it.

求助该文献

最长约 10秒，即可获得该文献文件

Text classification with improved word embedding and adaptive segmentation

今日热心研友