段落
计算机科学
人工智能
自然语言处理
特征(语言学)
文字袋模型
语义学(计算机科学)
人气
特征向量
支持向量机
语言学
心理学
社会心理学
万维网
哲学
程序设计语言
作者
Quoc V. Le,Tomáš Mikolov
出处
期刊:International Conference on Machine Learning
日期:2014-06-21
卷期号:4: 1188-1196
被引量:4522
摘要
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, powerful, strong and Paris are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI