作者
Liu Liu,Shao Hui Huang,Feng Jiang,Guo-Qing Liang,Xiaobin Zhu,Hong Zhu,Weidong Tian
摘要
How can integrating updated single-cell transcriptomics and protein-protein interactions (PPIs) with machine learning algorithms improve gene prioritization for spermatogenic failure and predict ICSI outcomes? A machine learning framework integrating single-cell RNA sequencing (scRNA-seq) and PPI networks efficiently identified 320 candidate genes for spermatogenic failure and achieved high precision in predicting ICSI outcomes (precision-recall (PRC)-AUC=0.96, 95% CI: 0.89-1.00; receiver operating characteristic (ROC)-AUC = 0.82, 95% CI: 0.63-0.97). Over 100 genes are implicated in spermatogenic failure, yet patients with distinct genetic backgrounds exhibit highly variable ICSI outcomes. While machine learning-based gene prioritization offers potential for novel gene discovery, the existing methods rely on bulk RNA sequencing or lack multi-omics integration, limiting their ability to leverage single-cell resolution or predict clinical outcomes. This study combined scRNA-seq data (capturing cell type- and developmental stage-specific expression) from healthy human tissues with PPI networks to train predictive models. Validation included 5-fold cross-validation, functional enrichment analyses, and clinical data from whole-exome sequencing (WES) and ICSI outcomes in 34 patients with spermatogenic failure subtypes (azoospermia, asthenozoospermia, teratozoospermia). Public datasets (Human Protein Atlas, STRING, Gene Expression Omnibus) provided scRNA-seq and PPI data. Node2Vec-derived PPI network embeddings and cell type- and developmental stage-specific expression features were used to train random forest classifiers. Gene Ontology, Mammalian Phenotype Ontology enrichment analyses, and WES of patient blood samples validated candidate genes and ICSI outcomes. Our models demonstrated robust performance in spermatogenic failure gene prediction (PRC-AUC = 0.88, 95% CI: 0.83-0.93; ROC-AUC = 0.98, 95% CI: 0.96-0.99), subtype classification (e.g. teratozoospermia, PRC-AUC = 0.96, 95% CI: 0.91-0.99; ROC-AUC = 0.94, 95% CI: 0.87-0.98), and ICSI outcome prediction (PRC-AUC = 0.96, 95% CI: 0.89-1.00; ROC-AUC = 0.82, 95% CI: 0.63-0.97). WES of patient samples revealed an increased detection rate of likely causative variants among a subset of model-predicted genes, rising from 11.8% to 29.4%, with clinical outcomes aligning with model predictions. Model limitations include training on literature-curated or database-annotated gene labels, which may introduce misclassification or annotation bias. Additionally, the absence of experimental validation and the limited size and diversity of external cohorts necessitate further verification. This integrative machine learning framework provides a powerful tool for uncovering genetic contributors to male infertility and predicting treatment outcomes, paving the way for improved diagnostic strategies and more informed clinical decision-making in reproductive medicine. This work was supported by the National Natural Science Foundation of China (32370719, 32170667), the Shanghai Municipal Science and Technology Major Project (2017SHZDZX01), and the National Key Research and Development Program of China (2021YFC2301503). The authors declare no competing interests. N/A.