基因组
生物
遗传学
人类基因组
计算生物学
基因
癌症基因组测序
突变
种系突变
DNA测序
基因组学
编码区
作者
Maxime Tarabichi,Jonas Demeulemeester,Annelien Verfaillie,Adrienne M. Flanagan,Peter Van Loo,Tomasz Konopka
标识
DOI:10.1038/s41587-021-00971-y
摘要
A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.
科研通智能强力驱动
Strongly Powered by AbleSci AI