聚类分析
计算机科学
约束聚类
数据挖掘
公制(单位)
相关聚类
数据流聚类
CURE数据聚类算法
范畴变量
高维数据聚类
领域(数学分析)
共识聚类
模糊聚类
过程(计算)
机器学习
数学
数学分析
经济
操作系统
运营管理
作者
Olga Andreeva,Wei Li,Wei Ding,Marieke L. Kuijjer,John Quackenbush,Ping Chen
标识
DOI:10.1145/3394486.3403187
摘要
Clustering is an important unsupervised learning method with serious challenges when data is sparse and high-dimensional. Generated clusters are often evaluated with general measures, which may not be meaningful or useful for practical applications and domains. Using a distance metric, a clustering algorithm searches through the data space, groups close items into one cluster, and assigns far away samples to different clusters. In many real-world applications, the number of dimensions is high and data space becomes very sparse. Selection of a suitable distance metric is very difficult and becomes even harder when categorical data is involved. Moreover, existing distance metrics are mostly generic, and clusters created based on them will not necessarily make sense to domain-specific applications. One option to address these challenges is to integrate domain-defined rules and guidelines into the clustering process. In this work we propose a GAN-based approach called Catalysis Clustering to incorporate domain knowledge into the clustering process. With GANs we generate catalysts, which are special synthetic points drawn from the original data distribution and verified to improve clustering quality when measured by a domain-specific metric. We then perform clustering analysis using both catalysts and real data. Final clusters are produced after catalyst points are removed. Experiments on two challenging real-world datasets clearly show that our approach is effective and can generate clusters that are meaningful and useful for real-world applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI