计算机科学
代表(政治)
自编码
聚类分析
特征学习
领域(数学)
鉴定(生物学)
人工智能
高维数据聚类
外部数据表示
数据挖掘
机器学习
深度学习
数学
生物
政治
植物
法学
纯数学
政治学
作者
G Viaud,P. Mayilvahanan,Paul-Henry Cournède
标识
DOI:10.1109/tcbb.2021.3060340
摘要
The integration of several sources of data for the identification of subtypes of diseases has gained attention over the past few years. The heterogeneity and the high dimensions of the data sets calls for an adequate representation of the data. We summarize the field of representation learning for the multi-omics clustering problem and we investigate several techniques to learn relevant combined representations, using methods from group factor analysis (PCA, MFA and extensions) and from machine learning with autoencoders. We highlight the importance of appropriately designing and training the latter, notably with a novel combination of a disjointed deep autoencoder (DDAE) architecture and a layer-wise reconstruction loss. These different representations can then be clustered to identify biologically meaningful clusters of patients. We provide a unifying framework for model comparison between statistical and deep learning approaches with the introduction of a new weighted internal clustering index that evaluates how well the clustering information is retained from each source, favoring contributions from all data sets. We apply our methodology to two case studies for which previous works of integrative clustering exist, TCGA Breast Cancer and TARGET Neuroblastoma, and show how our method can yield good and well-balanced clusters across the different data sources.
科研通智能强力驱动
Strongly Powered by AbleSci AI