计算机科学
正确性
杠杆(统计)
人工智能
序列比对
嵌入
计算生物学
钥匙(锁)
结构线形
多序列比对
基因组学
机器学习
基因组
肽序列
生物
算法
遗传学
基因
计算机安全
作者
Felipe Llinares-López,Quentin Berthet,Mathieu Blondel,Olivier Teboul,Jean‐Philippe Vert
标识
DOI:10.1101/2021.11.15.468653
摘要
Abstract Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here, we leverage recent advances in deep learning for language modelling and differentiable programming to propose DEDAL, a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or three-fold the alignment correctness over existing methods on remote homologs, and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
科研通智能强力驱动
Strongly Powered by AbleSci AI