亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML

随机森林 支持向量机 药物发现 计算机科学 人工智能 药物开发 机器学习 数据挖掘 计算生物学 药品 生物信息学 生物 药理学
作者
Ayush Garg,Narayanan Ramamurthi,Shyam Sundar Das
出处
期刊:Journal of Chemical Information and Modeling [American Chemical Society]
标识
DOI:10.1021/acs.jcim.5c00023
摘要

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
梁梁完成签到 ,获得积分10
刚刚
iui飞发布了新的文献求助10
1秒前
agent完成签到 ,获得积分10
6秒前
13秒前
三叔发布了新的文献求助10
18秒前
19秒前
三叔完成签到,获得积分0
25秒前
Drew发布了新的文献求助30
25秒前
归尘完成签到,获得积分10
32秒前
32秒前
威武皮带完成签到,获得积分10
34秒前
Drew完成签到,获得积分10
40秒前
辛勤的泽洋完成签到 ,获得积分10
41秒前
崔玉婷完成签到,获得积分20
48秒前
领导范儿应助oni采纳,获得10
49秒前
Wu发布了新的文献求助10
49秒前
YoiEmu完成签到,获得积分10
1分钟前
CipherSage应助科研通管家采纳,获得10
1分钟前
1分钟前
重要问芙brk完成签到,获得积分10
1分钟前
Anhan发布了新的文献求助10
1分钟前
Fn完成签到 ,获得积分10
1分钟前
董绮敏完成签到 ,获得积分10
1分钟前
RHJ完成签到 ,获得积分10
1分钟前
搜集达人应助江洋大盗采纳,获得10
1分钟前
fangqiao完成签到,获得积分10
1分钟前
奥特斌完成签到 ,获得积分10
1分钟前
星辰大海应助fangqiao采纳,获得10
1分钟前
mmyhn发布了新的文献求助10
1分钟前
深情安青应助咚咚咚采纳,获得10
2分钟前
打喷嚏的猪完成签到,获得积分10
2分钟前
2分钟前
Wu完成签到,获得积分10
2分钟前
咚咚咚发布了新的文献求助10
2分钟前
冰西瓜完成签到 ,获得积分10
2分钟前
棍棍来也完成签到,获得积分10
2分钟前
开霁完成签到 ,获得积分10
2分钟前
2分钟前
由凡发布了新的文献求助10
2分钟前
剥橘子高手完成签到,获得积分10
2分钟前
高分求助中
【此为提示信息,请勿应助】请按要求发布求助,避免被关 20000
ISCN 2024 – An International System for Human Cytogenomic Nomenclature (2024) 3000
Continuum Thermodynamics and Material Modelling 2000
Encyclopedia of Geology (2nd Edition) 2000
105th Edition CRC Handbook of Chemistry and Physics 1600
Maneuvering of a Damaged Navy Combatant 650
the MD Anderson Surgical Oncology Manual, Seventh Edition 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3777580
求助须知:如何正确求助?哪些是违规求助? 3322938
关于积分的说明 10212621
捐赠科研通 3038270
什么是DOI,文献DOI怎么找? 1667263
邀请新用户注册赠送积分活动 798073
科研通“疑难数据库(出版商)”最低求助积分说明 758201