计算机科学
元数据
数据集成
主成分分析
信号(编程语言)
生物学数据
数据挖掘
比例(比率)
人工智能
模式识别(心理学)
生物信息学
生物
物理
量子力学
程序设计语言
操作系统
作者
Yang Zhou,Qiongyu Sheng,Shuilin Jin
标识
DOI:10.1073/pnas.2416516122
摘要
Constructing single-cell atlases requires preserving differences attributable to biological variables, such as cell types, tissue origins, and disease states, while eliminating batch effects. However, existing methods are inadequate in explicitly modeling these biological variables. Here, we introduce SIGNAL, a general framework that leverages biological variables to disentangle biological and technical effects, thereby linking these metadata to data integration. SIGNAL employs a variant of principal component analysis to align multiple batches, enabling the integration of 1 million cells in approximately 2 min. SIGNAL, despite its computational simplicity, surpasses state-of-the-art methods across multiple integration scenarios: 1) heterogeneous datasets, 2) cross-species datasets, 3) simulated datasets, 4) integration on low-quality cell annotations, and 5) reference-based integration. Furthermore, we demonstrate that SIGNAL accurately transfers knowledge from reference to query datasets. Notably, we propose a self-adjustment strategy to restore annotated cell labels potentially distorted during integration. Finally, we apply SIGNAL to multiple large-scale atlases, including a human heart cell atlas containing 2.7 million cells, identifying tissue- and developmental stage-specific subtypes, as well as condition-specific cell states. This underscores SIGNAL’s exceptional capability in multiscale analysis.
科研通智能强力驱动
Strongly Powered by AbleSci AI