文字2vec
计算机科学
潜在Dirichlet分配
主题模型
文字嵌入
tf–国际设计公司
情绪分析
词(群论)
非负矩阵分解
矢量化(数学)
情报检索
人工智能
自然语言处理
期限(时间)
数据挖掘
矩阵分解
嵌入
数学
量子力学
几何学
物理
特征向量
并行计算
作者
Nuraisa Novia Hidayati,Putri Damayanti,Agus Zainal Arifin
摘要
Tweet data on several official Twitter accounts from news portals can provide traffic information near real-time, which helps control smooth mobilization. However, the data is mixed with news on current issues, such as government policies and the pandemic situation. For this reason, a news grouping process is needed by finding word vectors through word embedding and inserting them into topic modeling to help separate traffic news from other news. We have compared two well-tested methods when processing Twitter data in various categories: Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization (NMF). In previous research, it appears that the two methods still find the words that compose the topic are quite challenging to interpret. Therefore, we use Word2vec as input to compare the term frequency-inverse document frequency (TF-IDF), which is very commonly used. It is hoped that Word2vec has collected related words and, in turn, will result in a better division of topics. This study shows that the combination of LDA with word vectorization with the Word2vec model presents a coherence value of 0.56 and the term frequency-inverse document frequency (TF-IDF) of 0.57. However, the application of Word2vec to NMF gave better results than TF-IDF. TF-IDF was only able to achieve a coherence value of 0.49 while Word2vec got 0.52. Furthermore, at NMF, the word2vec model can recognize words in the form of locations successfully. When the traffic news has been separated, we applied Named Entity Recognition (NER) to detect the incident's location. We've labeled the location of 30% of the tweet data that has been grouped into training data. This method has successfully detected the location when tested on some other data.
科研通智能强力驱动
Strongly Powered by AbleSci AI