文字2vec
计算机科学
词(群论)
人工智能
文字嵌入
嵌入
文本分割
分割
序列(生物学)
集合(抽象数据类型)
模式识别(心理学)
滤波器(信号处理)
语音识别
数学
生物
遗传学
程序设计语言
计算机视觉
几何学
作者
Guoying Sun,Yanan Cheng,Zhaoxin Zhang,Xiaojun Tong,Tingting Chai
标识
DOI:10.1016/j.eswa.2023.121852
摘要
Text classification first needs to convert the text into embedding vectors. Considering that static word embedding models such as Word2vec do not consider the position information of word and the difference of its role in different documents, while dynamic word embedding models such as Bert consume a large amount of time. An improved word embedding model based on pre-trained Word2vec is proposed, which achieves better classification accuracy and much lower classification time than Bert. At first, the concept of Term Document Frequency (TDF) is proposed on the basis of TF-IDF, and the TF-IDF-TDF of each word in different documents is calculated. Then, The positional encoding is added. Finally, in order to reduce the misleading of words with low importance, a filter is designed to set the embedding vector with low importance to zero. Considering that the sequence length that the deep learning model can handle is limited, and the text sequence exceeding the Maximum Sequence Length (MSL) set by the deep learning model will be directly truncated and discarded, an adaptive segmentation model is proposed, which can set different segmentation strategies for different texts according to the length of the text and the MSL. In order to maintain the continuity of adjacent text after segmentation, an adjacent-segment-vector-attended co-attention network is designed. In addition, the multi-channel convolution and the capsule network are designed to further extract deep hidden features. Multiple comparative experiment results show that the proposed model achieves the best Accuracy and Micro-F1 on five long text baseline datasets and six short text baseline datasets. In addition, when the MSL is not set too large compared with the document length in the dataset, the classification results of the proposed model are not affected by it.
科研通智能强力驱动
Strongly Powered by AbleSci AI