A near-complete assembly of an Arabidopsis thaliana genome

生物 拟南芥 拟南芥 基因组 计算生物学 遗传学 基因 突变体
作者
Xueren Hou,Depeng Wang,Zhukuan Cheng,Ying Wang,Yuling Jiao
出处
期刊:Molecular Plant [Elsevier]
卷期号:15 (8): 1247-1250 被引量:34
标识
DOI:10.1016/j.molp.2022.05.014
摘要

The genome sequence of Arabidopsis thaliana, a widely adopted model species, has greatly expedited plant molecular biology research. Over 20 years after the first release of the genome sequence (Arabidopsis Genome Initiative, 2000Arabidopsis Genome InitiativeAnalysis of the genome sequence of the flowering plant Arabidopsis thaliana.Nature. 2000; 408: 796-815https://doi.org/10.1038/35048692Crossref PubMed Scopus (7274) Google Scholar), there remains unresolved gap regions that are presumably composed of highly repetitive sequences, such as telomeres, centromeres, 5S rDNA clusters, and nucleolar organizer regions (NORs) containing 45S rDNA. It is difficult to assemble such repeats using comparatively short sequencing reads. A scan of the widely used TAIR10/Araport11 assembly (Lamesch et al., 2012Lamesch P. Berardini T.Z. Li D. Swarbreck D. Wilks C. Sasidharan R. Muller R. Dreher K. Alexander D.L. Garcia-Hernandez M. et al.The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools.Nucleic Acids Res. 2012; 40: D1202-D1210https://doi.org/10.1093/nar/gkr1090Crossref PubMed Scopus (1450) Google Scholar) of Arabidopsis thaliana accession Col-0 identified 165 gaps that encompass all five centromeres, and no single chromosome has been finished end to end. Here, we present a high-quality assembly containing three gapless chromosomes and two chromosomes only missing sequences in the two NORs and a telomere at the end of NOR4. By combining long-read Oxford Nanopore Technologies (ONT), high-fidelity long-read PacBio, and short-read Illumina technologies, we obtained a new 133,917,231-bp assembly of accession Col-0, named Col-PEK, which is 14,770,883-bp larger than the TAIR10/Araport11 assembly. Moreover, we filled most remaining gaps found in two recently released high-quality assemblies, Col-CEN and Col-XJTU (Naish et al., 2021Naish M. Alonge M. Wlodzimierz P. Tock A.J. Abramson B.W. Schmücker A. Mandáková T. Jamge B. Lambing C. Kuo P. et al.The genetic and epigenetic landscape of the Arabidopsis centromeres.Science. 2021; 374: abi7489https://doi.org/10.1126/science.abi7489Crossref Scopus (103) Google Scholar; Wang et al., 2021Wang B. Yang X. Jia Y. et al.High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads.Genomics Proteomics Bioinformatics. 2021; https://doi.org/10.1016/j.gpb.2021.08.003Crossref Scopus (45) Google Scholar). In this near-complete genome assembly, a total of 27,629 protein-coding genes were annotated, of which 213 are novel. Many of these new genes are located within NORs or centromeres. Notably, we found at least 145 new genes resulting from previously unrecognized hidden duplications, including tandem repeats, which significantly expand our understanding of recent gene duplication. Within the five complete centromeres, we observed that the number of 178-bp tandem satellite DNA repeats (CEN180) was substantially higher than previously assumed. We integrated Nanopore ONT, PacBio HiFi, and Illumina NovaSeq reads for preliminary assembly, polishing, and decontamination. Subsequently, we anchored contigs at the chromosome level within the framework of TAIR10 and then filled the two gaps on Chr4 using HiFi contigs/scaffolds. By filling all the gaps and anchoring to TAIR10, we obtained identical results. Finally, we corrected structural errors and small misassemblies in regions based only on ONT reads and then checked for potential deletions in NORs using HiFi contigs and read alignment. The final Col-PEK assembly is 133,917,231 bp in size, with all centromeres complete (supplemental methods; Supplemental Figure 1). We confirmed that the assembly is of high quality by Benchmarking Universal Single-Copy Orthologs (Supplemental Table 1), Core Eukaryotic Gene Mapping Approach evaluations (Supplemental Table 2), GC-Depth analysis (Supplemental Figure 2), Merqury (Rhie et al., 2020Rhie A. Walenz B.P. Koren S. Phillippy A.M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies.Genome Biol. 2020; 21: 245https://doi.org/10.1186/s13059-020-02134-9Crossref PubMed Scopus (278) Google Scholar) and Inspector (Chen et al., 2021Chen Y. Zhang Y. Wang A.Y. Gao M. Chong Z. Accurate long-read de novo assembly evaluation with Inspector.Genome Biol. 2021; 22: 312https://doi.org/10.1186/s13059-021-02527-4Crossref PubMed Scopus (27) Google Scholar) evaluations (supplemental methods; Supplemental Tables 3 and 4), SNP analysis (Supplemental Table 5), and read alignment using raw Illumina filtered reads, HiFi reads, and ONT reads (supplemental methods; Supplemental Figures 3 and 4). Notably, Merqury evaluation indicates that Col-PEK is substantially higher quality than TAIR10 and Col-CEN and is comparable or slightly higher quality than Col-XJTU (Supplemental Table 3). All sequenced reads from the centromeric region were confirmed using CEN180-specific 11-mer sequences, and Merqury evaluation of the five centromeres shows extremely high accuracy, with an error rate for Chr2 (CEN2) as low as 0 (supplemental methods; Supplemental Table 4). We compared Col-PEK with TAIR10, Col-XJTU, and Col-CEN assemblies (Naish et al., 2021Naish M. Alonge M. Wlodzimierz P. Tock A.J. Abramson B.W. Schmücker A. Mandáková T. Jamge B. Lambing C. Kuo P. et al.The genetic and epigenetic landscape of the Arabidopsis centromeres.Science. 2021; 374: abi7489https://doi.org/10.1126/science.abi7489Crossref Scopus (103) Google Scholar; Wang et al., 2021Wang B. Yang X. Jia Y. et al.High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads.Genomics Proteomics Bioinformatics. 2021; https://doi.org/10.1016/j.gpb.2021.08.003Crossref Scopus (45) Google Scholar) and found perfect collinearity (Figures 1A–1C ; Supplemental Figures 5A and 6). The new assembly added ∼14.8 Mb of novel sequences, which are mostly located near or within the centromeres (Figure 1A; Supplemental Figures 5A and 7–11). In addition to centromeres, we added ∼499 and ∼183 kb of sequence to the end of the top arms of Chr2 and Chr4, respectively (Supplemental Figures 1 and 7–11). Sequence alignment indicates that these new sequences contain 45S rDNA subunits (i.e., 5.8S, 18S, and 25S rDNA) (Supplemental Table 6), suggesting that they are part of NORs (Sims et al., 2021Sims J. Sestini G. Elgert C. von Haeseler A. Schlögelhofer P. Sequencing of the Arabidopsis NOR2 reveals its distinct organization and tissue-specific rRNA ribosomal variants.Nat. Commun. 2021; 12: 387https://doi.org/10.1038/s41467-020-20728-6Crossref PubMed Scopus (24) Google Scholar). Although substantially longer (>98.55%) than TAIR10, both NORs contain gaps that remain to be finished (supplemental methods). We further applied coverage analysis to estimate the copy number of repeats using Illumina reads (Long et al., 2013Long Q. Rabanal F.A. Meng D. Huber C.D. Farlow A. Platzer A. Zhang Q. Vilhjálmsson B.J. Korte A. Nizhynska V. et al.Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden.Nat. Genet. 2013; 45: 884-890https://doi.org/10.1038/ng.2678Crossref PubMed Scopus (265) Google Scholar). The estimated copy number (>310) of 45S rDNA was much larger than the assembled unit number (∼66) (Supplemental Table 6), providing an estimation of NOR size. We also identified 2.6 kb of telomeric repeats adjacent to the Chr2 NOR, while the Chr4 NOR still lacks telomeric repeats. In total, nine telomeres were identified, ranging from 2.6 to 3.6 kb in size (Supplemental Table 7). Comparison with the recently released high-quality assemblies, including Col-CEN and Col-XJTU, indicates that Col-PEK is highly intact, is longer in length, and has filled multiple remaining gaps longer than 40 kb (Figures 1A–1C; Supplemental Figures 3 and 6; Supplemental Tables 8 and 9; supplemental methods). For example, a 108.7-kb gap on Chr2 that was left by Col-XJTU has now been closed (Figure 1B; Supplemental Figures 3A and 12A). In Col-CEN, a 232.8-kb unknown gap has now been identified and filled within the mtDNA insertion region in Chr2. The resulting mtDNA insert size (640.5 kb) is consistent with previous fiber-fluorescence in situ hybridization estimation (618 ± 42 kb) (Stupar et al., 2001Stupar R.M. Lilly J.W. Town C.D. Cheng Z. Kaul S. Buell C.R. Jiang J. Complex mtDNA constitutes an approximate 620-kb insertion on Arabidopsis thaliana chromosome 2: implication of potential sequencing errors caused by large-unit repeats.Proc. Natl. Acad. Sci. USA. 2001; 98: 5099-5103https://doi.org/10.1073/pnas.091110398Crossref PubMed Scopus (163) Google Scholar) and that of Col-XJTU (Wang et al., 2021Wang B. Yang X. Jia Y. et al.High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads.Genomics Proteomics Bioinformatics. 2021; https://doi.org/10.1016/j.gpb.2021.08.003Crossref Scopus (45) Google Scholar) (Figures 1C and 1D; Supplemental Figures 3B, 12B, and 13). We also identified a 36.0-kb unknown gap in Chr1 from Col-CEN, which includes seven coding genes (Figure 1E; Supplemental Figures 3C, 6B, and 12B). These new sequences are well supported by ONT and HiFi reads (Supplemental Figure 3). These analyses explain why sequences (excluding NORs) of Chr2, Chr4, and Chr5 in Col-PEK are longer than those in Col-XJTU and Col-CEN (Supplemental Table 8). On the other hand, sequences of Chr1 and Chr3 in Col-PEK are slightly shorter than those in Col-XJTU, which may be due to missing sequences in Col-PEK. To investigate this, we evaluated the discriminate 21-kb region in Chr1 (Supplemental Figures 4A–4E) and an 11-kb region in Chr3 (Supplemental Figure 4F) and identified ONT pass reads and HiFi reads that continuously span the Col-PEK junctions but not the Col-XJTU junctions. Of note, Col-PEK has identical sequences to Col-CEN in these regions (Supplemental Figure 4). The Col-PEK assembly provides an unprecedented opportunity to estimate the distribution of repetitive sequences. We identified 26,079 simple sequence repeats, with a total length of 400,090 bp, and 46,108 tandem repeats, with a total length of 15,470,062 bp. Subsequently, we used RepeatMasker (http://www.repeatmasker.org/) to predict transposable elements and found 19,274,191 bp (14.40% of the genome sequence) attributable to transposable elements. Among them, LTR/Gypsy retroelements comprise the largest class, occupying 6,885,521 bp (5.14% of the genome sequence). The total repeat content (26.58%) is much higher than that of TAIR10 (18.51%) (Figure 1D; Supplemental Figure 14; Supplemental Tables 8 and 10). A total of 27,416 out of 27,445 protein-coding genes were lifted over from Araport11 to Col-PEK. The remaining genes were either located in misassembled TAIR10 regions or were too short (3–39 bp) (Supplemental Figures 15 and 16; Supplemental Table 11; supplemental methods). For example, AT3G41762 was found in a 26-kb misassembled region in TAIR10 but was also found to have four homologous copies that were reassembled into NOR2 and NOR4 in Col-PEK (Supplemental Figure 15; Supplemental Tables 11 and 12). A previous study also suggested that this region in TAIR10 may be problematic (Pucker et al., 2021Pucker B. Kleinbölting N. Weisshaar B. Large scale genomic rearrangements in selected Arabidopsis thaliana T-DNA lines are caused by T-DNA insertion mutagenesis.BMC Genomics. 2021; 22: 599https://doi.org/10.1186/s12864-021-07877-8Crossref PubMed Scopus (22) Google Scholar). Notably, we identified 145 previously unknown genes that are highly similar (>99% DNA sequence similarity) to existing genes (Figure 1D, inner circle; Supplemental Table 12; supplemental methods). Among these hidden duplicated genes, 70 are located in the two NORs and 47 are located in the above-mentioned mtDNA insert (Supplemental Figures 7–11). Some of the latter group putatively encode mitochondrial respiration pathway proteins according to the functional descriptions of the homologous genes provided by TAIR. At least 56 of the newly identified hidden duplicated genes form tandem repeats, in which two or more homologous genes are adjacent to each other along the chromosome (Figure 1D, inner circle; Supplemental Table 12). Hidden gene duplications have previously been found in limited cases, such as SEC10 (Vukašinović et al., 2014Vukašinović N. Cvrčková F. Eliáš M. Cole R. Fowler J.E. Žárský V. Synek L. Dissecting a hidden gene duplication: the Arabidopsis thaliana SEC10 locus.PLoS One. 2014; 9: e94077https://doi.org/10.1371/journal.pone.0094077Crossref PubMed Scopus (23) Google Scholar) (Supplemental Figure 16A), and our findings suggest that it is more common (supplemental methods; Supplemental Table 12). Different duplicated genes may be adjacent to each other. For example, a newly updated region in Chr5 contains two types of gene duplications, one type with one gene duplicated twice, and the other type with a block of three genes duplicated once (Supplemental Table 12), supporting a recent report (Pucker et al., 2021Pucker B. Kleinbölting N. Weisshaar B. Large scale genomic rearrangements in selected Arabidopsis thaliana T-DNA lines are caused by T-DNA insertion mutagenesis.BMC Genomics. 2021; 22: 599https://doi.org/10.1186/s12864-021-07877-8Crossref PubMed Scopus (22) Google Scholar). To further identify genes in the new sequences, we applied three independent approaches, ab initio prediction, homology search, and reference-guided transcriptome assembly and obtained another 68 new coding genes, including 17 genes supported by transcriptomics data (Figure 1F; Supplemental Figures 7–11; Supplemental Table 13). Most of these new genes are located in the NORs and the mtDNA insert, and some are scattered over centromeres bound by centromere-specific histone H3-like protein (CENH3) (Supplemental Figures 7–11). Due to its higher completeness, Col-PEK surpasses Col-CEN and Col-XJTU with respect to the number of identified non-coding RNA (ncRNA) genes. In total, we identified 5,959 ncRNA genes, including 3,910 coding for 5S rRNA, 71 for 18S rRNA, 64 for 25S rRNA, 66 for 5.8S rRNA, 648 for tRNA, and 1,200 for other ncRNAs including riboswitches and ribozymes (Supplemental Table 6). Notably, our analysis has substantially expanded the number of 5S rRNAs (Supplemental Tables 6 and 8) and reveals that many are concentrated in the vicinity of the centromeres of Chr3 to Chr5, interspaced with LTR/Gypsy elements (Supplemental Figures 7–11). PacBio HiFi data is helpful for filling gaps on Chr4 and retrieving repetitive sequences such as 5S rDNA and CEN180 arrays that are prone to loss, ensuring the advantage of Col-PEK in annotation of these repetitive elements (Supplemental Figure 17; supplemental methods). The five intact centromeres provide a unique opportunity to finely dissect centromere organization. We identified a total of 66,232 centromeric CEN180 repeats, which is more than those in the Col-CEN and Col-XJTU assemblies (Figure 1D; Supplemental Figures 7–11; Supplemental Table 8). The CEN180 array bodies in each centromere range from 2.36 to 4.40 Mb. CENH3 binds to expanded regions centered on CEN180 repeat clusters, defining functional centromeres. We found that the length of the CENH3-bound regions is roughly consistent with previous estimations of centromere size by physical maps (Hosouchi et al., 2002Hosouchi T. Kumekawa N. Tsuruoka H. Kotani H. Physical map-based sizes of the centromeric regions of Arabidopsis thaliana chromosomes 1, 2, and 3.DNA Res. 2002; 9: 117-121https://doi.org/10.1093/dnares/9.4.117Crossref PubMed Scopus (101) Google Scholar; Kumekawa et al., 2000Kumekawa N. Hosouchi T. Tsuruoka H. Kotani H. The size and sequence organization of the centromeric region of Arabidopsis thaliana chromosome 5.DNA Res. 2000; 7: 315-321https://doi.org/10.1093/dnares/7.6.315Crossref PubMed Scopus (77) Google Scholar, Kumekawa et al., 2001Kumekawa N. Hosouchi T. Tsuruoka H. Kotani H. The size and sequence organization of the centromeric region of Arabidopsis thaliana chromosome 4.DNA Res. 2001; 8: 285-290https://doi.org/10.1093/dnares/8.6.285Crossref PubMed Scopus (79) Google Scholar) and that of Col-XJTU but is ∼1.82 Mb longer than CENH3-bound regions of Col-CEN (Supplemental Table 14). In all chromosomes, CENH3 is enriched in the centromere core region and is less abundant in regions enriched with LTR/Gypsy. In addition, CENH3 is favorably associated with certain subsets of CEN180. Nanopore ONT sequencing offers opportunities for the detection of DNA methylation, which is highly correlated with bisulfite-sequencing results. We found that the NORs and 5S rDNA arrays are hypermethylated, and the centromeric regions show higher CpG methylation than the arms, although the CEN180 arrays are relatively hypomethylated. In addition, telomeric regions are hypomethylated (Figure 1D; Supplemental Figures 7–11 and 18). In conclusion, the newly obtained near-complete Col-PEK assembly of Arabidopsis thaliana accession Col-0, in conjunction with other recently reported high-quality assemblies (Naish et al., 2021Naish M. Alonge M. Wlodzimierz P. Tock A.J. Abramson B.W. Schmücker A. Mandáková T. Jamge B. Lambing C. Kuo P. et al.The genetic and epigenetic landscape of the Arabidopsis centromeres.Science. 2021; 374: abi7489https://doi.org/10.1126/science.abi7489Crossref Scopus (103) Google Scholar; Wang et al., 2021Wang B. Yang X. Jia Y. et al.High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads.Genomics Proteomics Bioinformatics. 2021; https://doi.org/10.1016/j.gpb.2021.08.003Crossref Scopus (45) Google Scholar), provides a long-awaited key resource for the plant community. An online information portal that includes an interactive searchable browser and downloadable genome assembly and annotation files for Col-PEK is available at http://col-pek.arashare.cn/. This work was supported by the National Key R&D Program of China grant 2019YFA0903900 (to Y.J. and Y.W.).
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
酸化土壤改良应助charles采纳,获得10
刚刚
酷波er应助白小黑采纳,获得10
刚刚
lyy发布了新的文献求助10
刚刚
1秒前
kay发布了新的文献求助10
2秒前
高尔基完成签到,获得积分10
2秒前
Dr.Lyo发布了新的文献求助10
2秒前
赵雪发布了新的文献求助10
3秒前
高尔基发布了新的文献求助20
6秒前
6秒前
6秒前
哈哈哈哈完成签到,获得积分10
7秒前
8秒前
蓑衣客发布了新的文献求助10
8秒前
万能图书馆应助Jasmine采纳,获得10
9秒前
领导范儿应助chanyc采纳,获得10
9秒前
ri_290发布了新的文献求助10
11秒前
11秒前
王雨辰完成签到,获得积分10
12秒前
琪琪发布了新的文献求助10
12秒前
13秒前
酸化土壤改良应助李超采纳,获得10
13秒前
16秒前
16秒前
王鹏发布了新的文献求助10
18秒前
阿大呆呆应助程贝采纳,获得50
18秒前
ENG关闭了ENG文献求助
19秒前
蘑菇头完成签到,获得积分10
20秒前
insissst发布了新的文献求助10
20秒前
毒never完成签到 ,获得积分10
21秒前
动听的囧完成签到,获得积分10
21秒前
rachimax发布了新的文献求助10
23秒前
李健应助高尔基采纳,获得10
23秒前
luvesther发布了新的文献求助10
23秒前
着急的新瑶应助空山新雨采纳,获得10
23秒前
慕青应助jiang采纳,获得10
23秒前
柯一一应助一个Circle采纳,获得10
24秒前
25秒前
miumiuka完成签到,获得积分10
25秒前
马一凡完成签到,获得积分10
26秒前
高分求助中
One Man Talking: Selected Essays of Shao Xunmei, 1929–1939 1000
Yuwu Song, Biographical Dictionary of the People's Republic of China 700
[Lambert-Eaton syndrome without calcium channel autoantibodies] 520
The three stars each: the Astrolabes and related texts 500
Revolutions 400
Diffusion in Solids: Key Topics in Materials Science and Engineering 400
Phase Diagrams: Key Topics in Materials Science and Engineering 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2448091
求助须知:如何正确求助?哪些是违规求助? 2122751
关于积分的说明 5400302
捐赠科研通 1851605
什么是DOI,文献DOI怎么找? 920833
版权声明 562185
科研通“疑难数据库(出版商)”最低求助积分说明 492578