计算机科学
人工智能
转录因子
因子(编程语言)
自然语言处理
计算生物学
基因
程序设计语言
生物
遗传学
作者
Liyuan Gao,K.-H. Shu,Jun Zhang,Victor S. Sheng
标识
DOI:10.1109/bibm58861.2023.10385498
摘要
Language models have exhibited remarkable performance across diverse tasks, including those in the realm of biological research such as protein language modeling. Transcription factors (TFs) are pivotal in gene regulation, influencing gene expression through specific DNA sequence binding. While various TF prediction techniques exist, they often necessitate extensive training datasets or suffer from limited accuracy. In this study, we propose an ESM-TFpredict model, which leverages a pre-trained protein language model to encode amino acid sequences, followed by 1-D convolutional neural networks for TF prediction. To elucidate the model's decision-making, we employ an integrated gradients method to highlight the important features driving TF identification. Comparative experimental analysis with existing models, DeepTFactor and TFpredict, reveals that the ESM-TFpredict achieves an accuracy exceeding 95% across four evaluation metrics, surpassing both competitors. By utilizing a slide window approach for protein representation compression, the training duration of ESM-TFpredict is 315.78 seconds, which is only 51% of the training time required by DeepTFactor and a mere 12% of the training time required by TFpredict. We further analyze the contributions of known TF-related regions (average attribution score 0.9152) versus Non-TF-related regions (average attribution score 0.0848), demonstrating that the TF-related regions have dominant influences on TF prediction.
科研通智能强力驱动
Strongly Powered by AbleSci AI