A Vision of How Low-Coverage Sequence Data Should Contribute to Genetic Evaluation in the Future

插补（统计学）单倍型序列（生物学）参考基因组全基因组测序基因组学计算生物学 DNA测序基因组数据挖掘计算机科学生物遗传学基因型基因缺少数据机器学习

作者

R. M. Thallman,J. E. Borgert,Bailey N. Engle,J. W. Keele,W. M. Snelling,Cedric Gondro,L. A. Kuehn

出处

期刊：Journal of Animal Science [Oxford University Press]
日期：2025-09-05 被引量：1

链接

nih.govdoi.org

标识

DOI：10.1093/jas/skaf294

摘要

Low-coverage sequencing refers to sequencing DNA of individuals to a low depth of coverage (e.g., 0.5X) and imputing that sequence to genomic sequence based on reference haplotypes from individuals sequenced to high depth of coverage (e.g., ≥ 10X). It has been proposed as an alternative to genotyping by SNP arrays. At least one commercial product based on it is available for agricultural species. Concerns limiting adoption in its current form are: 1) the cost of storing the huge volume of data it generates and 2) whether that additional data will result in improved accuracy of genetic evaluation. This work envisions future implementation of low-coverage sequencing to reduce storage costs and enhance genetic evaluations by leveraging the additional information in the full sequence of the pangenome to account for more genetic variation. We propose addressing the storage issue by representing genomic sequence of an individual in a pair of haplotype arrays with each element pointing to an enumerated haplotype of the sequence within one of approximately 50,000 defined genome segments. Assuming 60 million genomic variants, the infrastructure required to translate the identifier of any enumerated haplotype into its genomic sequence would require less than 10 gigabytes of binary storage. Each haplotype array element would require 2 bytes, so the marginal binary storage required to represent the genomic sequence of an individual would be about 200 kilobytes (KB), similar to the genotypes from a SNP array with 200,000 markers. This assumes no pedigree and no ambiguity of the imputation, though the latter is unrealistic. Strategies to minimize, and when necessary, to manage and efficiently represent ambiguity are proposed. The genomic sequence of an individual could be stored in about 1 KB (binary) if both parents have unambiguous sequence stored as described above. The proposed system for representing the pangenome includes algorithms for read mapping and imputation intended to leverage all known genetic variation in the target population. It is also designed to use sequencing reads generated for imputing genomic sequence of new individuals to identify unrecognized mutations, crossovers, and structural variants, thus continuously improving the genome representation, especially if widespread use of low-coverage sequencing in livestock industries is realized. This could make improved genetic merit and management of livestock feasible without computational burden.

求助该文献

最长约 10秒，即可获得该文献文件

A Vision of How Low-Coverage Sequence Data Should Contribute to Genetic Evaluation in the Future

今日热心研友