特征选择
计数数据
混合模型
Lasso(编程语言)
聚类分析
计算机科学
负二项分布
选型
贝叶斯信息准则
贝叶斯推理
贝叶斯概率
推论
特征(语言学)
数据挖掘
模式识别(心理学)
人工智能
数学
统计
泊松分布
哲学
万维网
语言学
作者
Yujia Li,Tanbin Rahman,Tianzhou Ma,Lu Tang,George C. Tseng
出处
期刊:Biostatistics
[Oxford University Press]
日期:2021-08-07
被引量:2
标识
DOI:10.1093/biostatistics/kxab025
摘要
Summary Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse $K$-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small $n$) with high-dimensional gene features (large $p$). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.
科研通智能强力驱动
Strongly Powered by AbleSci AI