清晨好,您是今天最早来到科研通的研友!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您科研之路漫漫前行!

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML

随机森林 支持向量机 药物发现 计算机科学 人工智能 药物开发 机器学习 数据挖掘 计算生物学 药品 生物信息学 生物 药理学
作者
Ayush Garg,Narayanan Ramamurthi,Shyam Sundar Das
出处
期刊:Journal of Chemical Information and Modeling [American Chemical Society]
卷期号:65 (8): 3976-3989 被引量:1
标识
DOI:10.1021/acs.jcim.5c00023
摘要

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
负责以山完成签到 ,获得积分10
52秒前
研友_nxw2xL完成签到,获得积分10
57秒前
如歌完成签到,获得积分10
1分钟前
Hello应助科研通管家采纳,获得10
1分钟前
田様应助科研通管家采纳,获得10
1分钟前
1分钟前
芙瑞完成签到 ,获得积分0
1分钟前
冥冥之极为昭昭完成签到,获得积分0
1分钟前
1分钟前
哈哈哈发布了新的文献求助10
1分钟前
坚定蘑菇完成签到 ,获得积分10
2分钟前
2分钟前
自然乘云发布了新的文献求助10
2分钟前
AAA电材哥发布了新的文献求助10
2分钟前
欢呼亦绿完成签到,获得积分10
2分钟前
我是老大应助AAA电材哥采纳,获得10
2分钟前
蝎子莱莱xth完成签到,获得积分10
2分钟前
氢锂钠钾铷铯钫完成签到,获得积分10
2分钟前
2分钟前
AAA电材哥发布了新的文献求助10
3分钟前
共享精神应助啊鸭采纳,获得30
3分钟前
4分钟前
科研通AI2S应助科研通管家采纳,获得10
5分钟前
Jasper应助自然乘云采纳,获得10
5分钟前
周萌完成签到 ,获得积分10
5分钟前
深情安青应助LYCORIS采纳,获得10
6分钟前
学生信的大叔完成签到,获得积分10
6分钟前
6分钟前
6分钟前
大个应助可爱的小杨采纳,获得10
6分钟前
科研通AI2S应助科研通管家采纳,获得10
7分钟前
小二郎应助科研通管家采纳,获得10
7分钟前
CodeCraft应助科研通管家采纳,获得10
7分钟前
大模型应助科研通管家采纳,获得10
7分钟前
silence完成签到,获得积分10
7分钟前
7分钟前
LYCORIS发布了新的文献求助10
7分钟前
8分钟前
自然乘云发布了新的文献求助10
8分钟前
熊仔仔熊完成签到 ,获得积分10
8分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Modern Epidemiology, Fourth Edition 5000
Handbook of pharmaceutical excipients, Ninth edition 5000
Digital Twins of Advanced Materials Processing 2000
Weaponeering, Fourth Edition – Two Volume SET 2000
Polymorphism and polytypism in crystals 1000
Social Cognition: Understanding People and Events 800
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 纳米技术 有机化学 物理 生物化学 化学工程 计算机科学 复合材料 内科学 催化作用 光电子学 物理化学 电极 冶金 遗传学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 6028105
求助须知:如何正确求助?哪些是违规求助? 7685374
关于积分的说明 16186105
捐赠科研通 5175332
什么是DOI,文献DOI怎么找? 2769419
邀请新用户注册赠送积分活动 1752861
关于科研通互助平台的介绍 1638682