作者
Dongdong Cheng,Xiaocui Jiang,Shuyin Xia,Guoyin Wang
摘要
With the swift advancement of information technology, vast amounts of high-dimensional data have accumulated across various domains. Clustering such data presents a significant challenge, as existing methods often suffer from slow execution speeds and reduced clustering accuracy. To tackle these issues, we introduce the granular-ball approach, which aims to decrease the number of sample points and enhance processing speed, while also improving clustering accuracy through feature selection. Granular-ball computing, a coarse-grained data representation technique, has demonstrated its advantages in enhancing classification and clustering models in recent studies. However, current granular-ball division techniques are inadequate for high-dimensional data. To confront the complexities arising from clustering high-dimensional data and improve upon existing granular-ball methods, this paper proposes a novel granular-ball division approach that leverages pseudo-labels and feature selection. This new method enables the identification of anchor points through an improved granular-ball division process, leading to the development of a fast spectral clustering algorithm for high-dimensional data, termed PLGB-FSC. Specifically, we initially employ weighted K-Means for feature to generate pseudo-labels. Subsequently, we conduct a primary stage of feature selection by utilizing the mutual information between pseudo-labels and features, thereby eliminating the interference caused by irrelevant features. We further refine the feature selection by combining standard deviation and pearson correlation coefficients to choose mutually independent features. Using these pseudo-labels, we then perform granular-ball division to obtain anchor points. Lastly, we construct a similarity matrix between all sample points and the anchor points, and leveraging spectral clustering for definitive clustering outcomes. Experimental evaluations reveal that PLGB-FSC surpasses state-of-the-art algorithms such as W-KMeans, WGB, GB-USC, RC-PCA-SC, GLUFC, FGOC, SFESA, SPCAFS, and LLSRFS, and it achieves higher accuracy and faster execution speed. The source code is available at https://github.com/DongdongCheng/PLGB-FSC.