过采样
计算机科学
子空间拓扑
机器学习
人工智能
支持向量机
随机森林
分类器(UML)
贝叶斯概率
随机子空间法
模式识别(心理学)
特征(语言学)
强化学习
朴素贝叶斯分类器
线性子空间
特征向量
过程(计算)
数据挖掘
Dirichlet分布
过度拟合
相关性(法律)
非线性系统
相关向量机
干扰素
关系(数据库)
实证研究
遗传算法
作者
Mahesh Kumbhar,Sunith Bandaru,Alexander Karlsson
标识
DOI:10.1007/s10462-025-11417-1
摘要
Abstract Many real-world machine learning classification problems suffer from imbalanced training data, where the least frequent label has high relevance and significance for the end user, such as equipment breakdowns or various types of process anomalies. This imbalance can negatively impact the learning algorithm and lead to misclassification of minority labels, resulting in erroneous actions and potentially high unexpected costs. Most previous oversampling methods rely only on the minority samples, often ignoring their overall density and distribution in relation to the other classes. In addition, most of them lack in the oversampling method’s explainability. In contrast, this paper proposes a novel oversampling method that considers a subspace of the feature-set for the creation of synthetic minority samples using nonlinear optimization of a class-sensitive objective function. Suitable subspaces for oversampling are identified through a Bayesian reinforcement strategy based on Dirichlet smoothing, which may be useful for explainable-AI. An empirical comparison of the proposed method is performed with 10 existing techniques on 18 real-world datasets using two traditional machine learning classifiers and four evaluation metrics. Statistical analysis of cross-validated runs over the 18 datasets and four metrics (i.e. 72 experiments) reveals that the proposed approach is among the best performing methods in 6 and 2 instances when using random forest classifier and support vector machine classifier, thus placing it at the top. The study also reveals that some feature combinations are more important than others for minority oversampling, and the proposed approach offers a way to identify such features.
科研通智能强力驱动
Strongly Powered by AbleSci AI