计算生物学
可扩展性
生物
嵌入
序列(生物学)
计算机科学
遗传学
数据挖掘
数据库
人工智能
作者
Zakieh Tayyebi,Allison R. Pine,Christina Leslie
标识
DOI:10.1038/s41592-024-02274-x
摘要
Abstract Standard scATAC sequencing (scATAC-seq) analysis pipelines represent cells as sparse numeric vectors relative to an atlas of peaks or genomic tiles and consequently ignore genomic sequence information at accessible loci. Here we present CellSpace, an efficient and scalable sequence-informed embedding algorithm for scATAC-seq that learns a mapping of DNA k -mers and cells to the same space, to address this limitation. We show that CellSpace captures meaningful latent structure in scATAC-seq datasets, including cell subpopulations and developmental hierarchies, and can score transcription factor activities in single cells based on proximity to binding motifs embedded in the same space. Importantly, CellSpace implicitly mitigates batch effects arising from multiple samples, donors or assays, even when individual datasets are processed relative to different peak atlases. Thus, CellSpace provides a powerful tool for integrating and interpreting large-scale scATAC-seq compendia.
科研通智能强力驱动
Strongly Powered by AbleSci AI