序列(生物学)
串联重复
直接重复
计算机科学
多序列比对
序列比对
分割
计算生物学
算法
生物
遗传学
人工智能
基因组
肽序列
基因
作者
Andreas Heger,Liisa Holm
出处
期刊:Proteins
[Wiley]
日期:2000-01-01
卷期号:41 (2): 224-237
被引量:332
标识
DOI:10.1002/1097-0134(20001101)41:2<224::aid-prot70>3.0.co;2-z
摘要
Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information.
科研通智能强力驱动
Strongly Powered by AbleSci AI