离群值
质心
计算机科学
随机性
聚类分析
k-最近邻算法
可扩展性
算法
数据挖掘
异常检测
理论(学习稳定性)
模式识别(心理学)
人工智能
数学
机器学习
数据库
统计
作者
Jiyong Liao,Xingjiao Wu,Yaxin Wu,Juelin Shu
标识
DOI:10.1016/j.knosys.2024.111742
摘要
K-means is an unsupervised method for vector quantification derived from signal processing. It is currently used in data mining and knowledge-discovery. The advantages of K-means include its simple operation, scalability, and suitability for processing large-scale datasets. However, K-means randomly selects the initial cluster center, which causes unstable clustering results, and outliers affect algorithm performance. To address this challenge, we propose a nearest-neighbor density peak (NNDP)-optimized initial cluster center and outlier removal algorithm. To solve the problem of randomly selecting the initial cluster center, we propose NNDP-based K-means (K-NNDP). K-NNDP automatically selects the initial cluster centers based on decision values, ensuring stable algorithm operation. In addition, we adopt a local search strategy to eliminate outliers, identify outliers using a set threshold, and use the median instead of the mean in subsequent centroid iterations to reduce the impact of outliers on the algorithm. It is worth mentioning that, to date, most previous studies have addressed the two problems independently, which makes it easy for the algorithm to fall into a local optimal solution. Therefore, we innovatively combine these two problems using K-nearest neighbor modeling. To evaluate the effectiveness of K-NNDP, we conducted comparative experiments on several synthetic and real-world datasets. K-NNDP outperformed two classical algorithms and six state-of-the-art improved K-means algorithms. The results prove that K-NNDP can effectively solve the problems of randomness and outlier influence of K-means, and the effect is significant.
科研通智能强力驱动
Strongly Powered by AbleSci AI