Text classification method based on self-training and LDA topic models

计算机科学 人工智能 集合(抽象数据类型) 模式识别(心理学) 支持向量机 代表(政治) 训练集 标记数据 光学(聚焦) 机器学习 文件分类 监督学习 一级分类 人工神经网络 物理 光学 政治 程序设计语言 法学 政治学
作者
Miha Pavlinek,Vili Podgorelec
出处
期刊:Expert Systems With Applications [Elsevier BV]
卷期号:80: 83-93 被引量:156
标识
DOI:10.1016/j.eswa.2017.03.020
摘要

Supervised text classification methods are efficient when they can learn with reasonably sized labeled sets. On the other hand, when only a small set of labeled documents is available, semi-supervised methods become more appropriate. These methods are based on comparing distributions between labeled and unlabeled instances, therefore it is important to focus on the representation and its discrimination abilities. In this paper we present the ST LDA method for text classification in a semi-supervised manner with representations based on topic models. The proposed method comprises a semi-supervised text classification algorithm based on self-training and a model, which determines parameter settings for any new document collection. Self-training is used to enlarge the small initial labeled set with the help of information from unlabeled data. We investigate how topic-based representation affects prediction accuracy by performing NBMN and SVM classification algorithms on an enlarged labeled set and then compare the results with the same method on a typical TF-IDF representation. We also compare ST LDA with supervised classification methods and other well-known semi-supervised methods. Experiments were conducted on 11 very small initial labeled sets sampled from six publicly available document collections. The results show that our ST LDA method, when used in combination with NBMN, performed significantly better in terms of classification accuracy than other comparable methods and variations. In this manner, the ST LDA method proved to be a competitive classification method for different text collections when only a small set of labeled instances is available. As such, the proposed ST LDA method may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
SuperWhite完成签到,获得积分10
刚刚
刚刚
刚刚
打打应助无无采纳,获得10
1秒前
艺想天开完成签到 ,获得积分10
1秒前
高挑的保温杯完成签到,获得积分10
1秒前
2秒前
迷人的德天完成签到,获得积分10
2秒前
nana发布了新的文献求助10
2秒前
panmin发布了新的文献求助20
2秒前
111完成签到,获得积分10
2秒前
Puddingo完成签到,获得积分10
3秒前
李健的粉丝团团长应助77采纳,获得10
3秒前
3秒前
星辰大海应助Rylee采纳,获得10
3秒前
3秒前
耍酷问兰完成签到,获得积分10
3秒前
情怀应助QQQ采纳,获得10
3秒前
淡淡荟完成签到,获得积分10
4秒前
Akim应助chen采纳,获得10
4秒前
北克发布了新的文献求助10
4秒前
5秒前
5秒前
5秒前
5秒前
5秒前
zas发布了新的文献求助10
5秒前
6秒前
英姑应助rpFengMing采纳,获得10
6秒前
JJ完成签到,获得积分10
6秒前
今后应助科研通管家采纳,获得30
6秒前
英俊的铭应助科研通管家采纳,获得10
7秒前
小马甲应助科研通管家采纳,获得10
7秒前
李爱国应助科研通管家采纳,获得10
7秒前
小二郎应助科研通管家采纳,获得10
7秒前
上官若男应助科研通管家采纳,获得10
7秒前
7秒前
myyhcb完成签到,获得积分10
7秒前
shizhiheng完成签到 ,获得积分10
7秒前
sagitar应助科研通管家采纳,获得20
7秒前
高分求助中
Adhesion Science: Principles & Practice 1234
Signals, Systems, and Signal Processing 610
Burger's Medicinal Chemistry and Drug Discovery 400
A Step-by-Step Guide to Qualitative Data Coding 2nd Edition 400
Impact of Storage Orientation and Duration on Prefilled Syringe Performance: Break-Loose and Glide Forces, and Injection Time Across Multiple Time Points 360
Programming for Chemical Engineers Using C, C++, and MATLAB 300
Upland Kenya wild flowers and ferns: a flora of the flowers, ferns, grasses, and sedges of highland Kenya 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6665669
求助须知:如何正确求助?哪些是违规求助? 8415204
关于积分的说明 17989207
捐赠科研通 5871581
什么是DOI,文献DOI怎么找? 2975796
邀请新用户注册赠送积分活动 1951705
关于科研通互助平台的介绍 1878614