计算机科学
计算生物学
自然语言处理
人工智能
生物
作者
Xingyu Liao,Yanyan Li,Yingfu Wu,Wen Long,M. Q. Jing,Bolin Chen,Xingyi Li,Xuequn Shang
标识
DOI:10.1021/acssynbio.5c00631
摘要
The accurate classification of Cas proteins is crucial for understanding CRISPR-Cas systems and developing genome-editing tools. Here, we present TEMC-Cas, a deep learning framework for accurate classification of Cas proteins that combines a finely tuned ESM protein language model with contrastive learning. Unlike traditional methods that rely on sequence similarity (e.g., BLAST, HMMs) or structural prediction, TEMC-Cas leverages evolutionary-scale modeling to capture distant homology while employing contrastive learning to distinguish closely related subtypes. The framework incorporates LoRA for efficient parameter adaptation and addresses class imbalance through weighted loss functions. TEMC-Cas achieves superior performance in classifying the Cas1-Cas13 families and 17 Cas12 subtypes, demonstrating particular strength in identifying remote homology. This approach provides a robust tool for the discovery of the CRISPR system and expands the toolbox for genome engineering applications. TEMC-Cas is now freely accessible at https://github.com/Xingyu-Liao/TEMC-Cas.
科研通智能强力驱动
Strongly Powered by AbleSci AI