同源建模
同源(生物学)
计算机科学
序列同源性
计算生物学
拓扑(电路)
数学
生物
遗传学
组合数学
基序列
基因
生物化学
酶
作者
Damiano Sgarbossa,Cyril Malbranke,Anne‐Florence Bitbol
标识
DOI:10.1101/2024.05.24.595730
摘要
Abstract Protein design has important implications for drug discovery, personalized medicine, and biotechnology. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect. We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. We train ProtMamba on a large dataset of concatenated homologous sequences, using two GPUs. We combine autoregressive modeling and masked language modeling through a fill-in-the-middle training objective. This makes the model adapted to various protein design applications. We demonstrate ProtMamba’s usefulness for the generation of novel sequences and for fitness prediction. ProtMamba reaches competitive performance with other protein language models despite its smaller size, which sheds light on the importance of long-context conditioning.
科研通智能强力驱动
Strongly Powered by AbleSci AI