计算机科学
采样(信号处理)
分层抽样
过采样
人工智能
机器学习
分类器(UML)
班级(哲学)
选择(遗传算法)
简单随机抽样
数据挖掘
统计
数学
带宽(计算)
计算机网络
滤波器(信号处理)
计算机视觉
人口
人口学
社会学
作者
Cian Lin,Chih‐Fong Tsai,Wei-Chiang Lin
标识
DOI:10.1007/s10462-022-10186-5
摘要
The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.
科研通智能强力驱动
Strongly Powered by AbleSci AI