欠采样
后验概率
先验概率
歪斜
计算机科学
采样(信号处理)
人工智能
贝叶斯概率
概率分布
贝叶斯定理
条件概率
统计
朴素贝叶斯分类器
机器学习
模式识别(心理学)
数学
支持向量机
滤波器(信号处理)
电信
计算机视觉
作者
Andrea Dal Pozzolo,Olivier Caelen,R. A. Johnson,Gianluca Bontempi
摘要
Under sampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that under sampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier. In this paper, we study analytically and experimentally how under sampling affects the posterior probability of a machine learning model. We formalize the problem of under sampling and explore the relationship between conditional probability in the presence and absence of under sampling. Although the bias due to under sampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after under sampling. Experiments on several real-world unbalanced datasets validate our results.
科研通智能强力驱动
Strongly Powered by AbleSci AI