计算机科学
杠杆(统计)
图形
试验装置
人工智能
一般化
卷积神经网络
数据挖掘
机器学习
空间分析
代表(政治)
模式识别(心理学)
序列(生物学)
蛋白质结构预测
集合(抽象数据类型)
多层感知器
注意力网络
构造(python库)
感知器
序列标记
溶解度
作者
Guanjie Song,Zhenjie Luo,Aoyun Geng,Junlin Xu,Yajie Meng,Feifei Cui,Leyi Wei,Quan Zou,Zilong Zhang
标识
DOI:10.1021/acs.jcim.5c02262
摘要
Research on protein solubility holds critical significance in industrial production, biopharmaceuticals, and the food industry. While numerous studies have focused on predicting protein solubility in recent years, existing models often rely solely on one-dimensional sequence information and three-dimensional spatial contact data, failing to fully leverage other 3D structural features and the global physicochemical properties of proteins. Concurrently, advancements in protein language models (e.g., ESM-C) and spatial structure prediction models (e.g., AlphaFold3) have provided more accurate and comprehensive information for extracting sequence and spatial features, creating new opportunities for solubility prediction. We propose a novel model, called FGNNSol. The model first employs AlphaFold3 to predict the three-dimensional structure of proteins, leveraging this structural information to construct edges and generate edge features. It simultaneously utilizes ESM-C embeddings and other residue-level properties as node features while incorporating protein-level global features to collectively build a comprehensive protein graph representation. It then trains a model integrating GPSol (an improved graph attention network) and Graph Convolutional Networks. Finally, the integrated representation from these networks is concatenated with global protein features and input into a multilayer perceptron to output the prediction. Experimental validation was performed using the Escherichia coli eSOL dataset as the training, validation, and test sets for model development and evaluation, while the Saccharomyces cerevisiae dataset was used as an external test set to assess the model’s generalization capability. Results show that FGNNSol achieved R2 values of 0.577 and 0.469 on the two test sets, respectively, clearly outperforming existing models. When a threshold of 0.5 was used to classify proteins as soluble or insoluble, the binary classification metrics of FGNNSol were also mostly superior to those of existing models, fully demonstrating its effectiveness and generalization ability. The model code and data are available via https://github.com/SCrownJ/FGNNSol.
科研通智能强力驱动
Strongly Powered by AbleSci AI