顺序装配
散列函数
算法
基因组
倍性
计算机科学
杂交基因组组装
DNA测序
图形
软件
计算生物学
功能(生物学)
生物
基因组学
理论计算机科学
DNA
遗传学
基因
基因表达
转录组
程序设计语言
计算机安全
作者
Laura Natalia González-García,David Guevara-Barrientos,Daniela Lozano‐Arce,Juanita Gil,Jorge Díaz-Riaño,Erick Duarte,Germán I. Andrade,Juan Camilo Bojacá,Maria Camila Hoyos-Sanchez,Christian Chavarro,Natalia Guayazán Palacios,Luis Alberto Chica Cárdenas,Maria Camila Buitrago Acosta,Edwin Bautista,Miller Trujillo,Jorge Duitama
标识
DOI:10.26508/lsa.202201719
摘要
Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.
科研通智能强力驱动
Strongly Powered by AbleSci AI