转录因子
DNA结合位点
计算机科学
因子(编程语言)
DNA
计算生物学
学习迁移
自然语言处理
人工智能
遗传学
生物
发起人
基因
程序设计语言
基因表达
作者
Ekin Deniz Aksu,Martin Vingron
标识
DOI:10.1101/2024.11.08.622635
摘要
Abstract Identification of in vivo transcription factor (TF) binding sites is crucial to understand gene regulatory networks, but the lack of scalability in the methods for their experimental identification directs researchers towards computational models. TF binding site prediction models are often specific for a given TF, which also hinders the generalizability of models to previously unseen TFs. Here, we present an approach to predict in vivo TF binding sites using DNA accessibility, TF RNA expression and TF binding motifs. Our novel method leverages DNA language model embeddings and transfer learning to improve its accuracy and generalizability, achieving a mean area under the precision-recall curve (AUPR) of 0.51 in held-out cell types and chromosomes in the ENCODE-DREAM in vivo TFBS prediction challenge, outperforming the top-ranked methods. Furthermore, we show that prediction accuracy increases when TFs are highly active and exhibit cell-type specific expression. We finally test our models in an independent dataset on previously unseen TFs, and report a mean AUPR of 0.36, which is state-of-the-art in a cross-TF, cross-cell type and cross-chromosomal setting.
科研通智能强力驱动
Strongly Powered by AbleSci AI