Improved maize reference genome with single-molecule technologies

An improved reference genome for maize, using single-molecule sequencing and high-resolution optical mapping, enables characterization of structural variation and repetitive regions, and identifies lineage expansions of transposable elements that are unique to maize. Supplementary information The online version of this article (doi:10.1038/nature22971) contains supplementary material, which is available to authorized users.

research, which will enable increases in yield to feed the growing world population. The current assembly of the maize genome, based on Sanger sequencing, was first published in 2009 (ref. 3). Although this initial reference enabled rapid progress in maize genomics 1 , the origi nal assembly is composed of more than 100,000 small contigs, many of which are arbitrarily ordered and oriented, markedly complicating detailed analysis of individual loci 6 and impeding investigation of inter genic regions crucial to our understanding of phenotypic variation 7,8 and genome evolution 9,10 .
Here we report a vastly improved de novo assembly and annotation of the maize reference genome (Fig. 1). On the basis of 65× single molecule realtime sequencing (SMRT) (Extended Data Fig. 1), we assembled the genome of the maize inbred line B73 into 2,958 contigs, in which half of the total assembly is made up of contigs larger than 1.2 Mb (Table 1, Extended Data Fig. 2a). The assembly of the long reads was then integrated with a highquality optical map (Extended Data Fig. 1, Extended Data Table 1) to create a hybrid assembly consisting of 625 scaffolds (Table 1). To build chromosomelevel superscaffolds, we combined the hybrid assembly with a minimum tiling path generated from the bacterial artificial chromosomes (BACs) 11 and a highdensity genetic map 12 (Extended Data Fig. 2b). After gapfilling and error correction using short sequence reads, the total size of maize B73 RefGen_v4 pseudomolecules was 2,106 Mb. The new reference assembly has 2,522 gaps, of which almost half (n = 1,115) have optical

OPEN
Letter reSeArCH map coverage, giving an estimated mean gap length of 27 kb (Extended Data Fig. 2c). The new maize B73 reference genome has 240fold higher contiguity than the recently published shortread genome assembly of maize cultivar PH207 (contig N50: 1,180 kb versus 5 kb) 13 .
Comparison of the new assembly to the previous BACbased maize reference genome assembly revealed more than 99.9% sequence identity and a 52fold increase in the mean contig length, with 84% of the BACs spanned by a single contig from the long reads assembly. Alignment of chromatinimmunoprecipitation followed by sequencing (ChIP-seq) data for the centromerespecific histone H3 (CENH3) 14 revealed that centromeres are accurately placed and largely intact. Several previously identified 15 megabasesized misoriented pericentromeric regions were also corrected (Extended Data Fig. 3a, b). Moreover, the ends of the chromosomes are properly identified on 14 out of the 20 chromosome arms based on the presence of tandem telomeric repeats and knob 180 sequences (Extended Data Fig. 3a, c).
Our assembly made substantial improvements in the gene space including resolution of gaps and misassemblies and correction of order and orientation of genes. We also updated the annotation of our new assembly, resulting in consolidation of gene models (Extended Data Fig. 4). Newly published fulllength cDNA data 4 improved the anno tation of alternative splicing by more than doubling the number of alternative transcripts from 1.6 to 3.3 per gene (Extended Data Fig. 5a), with about 70% of genes supported by the fulllength transcripts. Our reference assembly also vastly improved the coverage of regulatory sequences, decreasing the number of genes exhibiting gaps in the 3kb region(s) flanking coding sequence from 20% to < 1% (Extended Data Fig. 5b). The more complete sequence enabled notable improvements in the annotation of core promoter elements, especially the TATA box, CCAATbox, and Y patch motifs (Supplementary Information). Quantitative genetic analyses have shown that polymorphisms in regu latory regions explain a substantial majority of the genetic variation for many phenotypes 7,8 , suggesting that the new reference will markedly improve our ability to identify and predict functional genetic variation.
After its divergence from Sorghum, the maize lineage underwent genome doubling followed by diploidization and gene loss. Previous work showed that gene loss is biased towards one of the parental genomes 3,16 , but our new assembly and annotation instead suggest that 56% of syntenic sorghum orthologues map uniquely to the dominant maize subgenome (designated A, total size 1.16 Gb), whereas only 24% map uniquely to subgenome B (total size 0.63 Gb). Gene loss in maize has primarily been considered in the context of polyploidy and func tional redundancy 16 , but we found that despite its polyploidy, maize has lost a larger proportion (14%) of the 22,048 ancestral gene orthologues than any of the other four grass species evaluated to date (Sorghum, rice, Brachypodium distachyon and Setaria italica; Extended Data Fig. 6). Nearly onethird of these losses are specific to maize, and analysis of a restricted highconfidence set revealed enrichment for genes involved in biotic and abiotic stresses (Extended Data Table 2), for example, NBARC domain diseaseresistance genes 17 and the serpin protease inhibitor involved in pathogen defence and programmed cell death 18 .
Transposable elements were first reported in maize 19 and have since been shown to have important roles in shaping genome evolution and gene regulatory networks of many species 20 . Most of the maize   Letter reSeArCH genome is derived from transposable elements 3,21 , and careful study of a few regions has revealed a characteristic structure of sequentially nested retrotransposons 21,22 and the effect of deletions and recombina tion on retrotransposon evolution 23 . In the annotation of the original maize assembly, however, fewer than 1% of long terminal repeat (LTR) retrotransposon copies were intact 24 . By applying a new homology independent annotation pipeline to our assembly (Extended Data  Table 3) To understand the evolutionary history of maize LTR retrotransposons, we also applied our annotation pipeline to the sorghum reference genome, and used reverse transcriptase protein domain sequences that were accessible owing to the improved assembly of the internal protein coding domains of maize LTR retrotransposons to reconstruct the phylogeny of maize and sorghum LTR retrotransposon families. Despite a higher overall rate of diversification of LTR transposable elements in the maize lineage consistent with its larger genome size, differences in LTR retrotrans poson content between genomes were primarily the result of marked expansion of distinct families in both lineages (Fig. 2).
Maize exhibits tremendous genetic diversity 25 , and both nucleotide polymorphisms and structural variations have important roles in its phenotypic variation 10,26 . However, genomewide patterns of structural variation in plant genomes are difficult to assess 27 , and previous efforts have relied on shortread mapping, which misses the vast majority of intergenic spaces where most rearrangements are likely to occur 10 . To investigate structural variation at a genomewide scale, we generated optical maps (Extended Data Table 1) for two additional maize inbred lines: the tropical line Ki11, one of the founders of the maize nested association mapping (NAM) population 28 , and W22, which has served as a foundation for studies of maize genetics 29 . Owing to the high degree of genomic diversity among these lines, only 32% of the assembled 2,216 Mb map of Ki11 and 39% of the 2,280 Mb W22 map could be mapped to our new B73 reference via common restriction patterns (Table 2, Fig. 3a and Extended Data Fig. 7). The high density of alignments across and near many of the exceedingly retrotransposon rich centromeres reflects the comparatively low genetic diversity of most centromeres in domesticated maize 15 and illustrates the ability of the combined optical mapping/singlemolecule sequencing methodology to traverse large repeatrich regions. Within the aligned regions, approximately 32% of the Ki11 and 26% of the W22 optical maps exhibited clear evidence of structural variation, including 3,408 insertions and 3,298 deletions ( Table 2). The average indel size was approximately 20 kb, with a range from 1 kb to over 1 Mb (Fig. 3b). More than 90% of the indels were unique to one inbred or the other, indicating a high level of structural diversity in maize. As shortread sequence data are available from both Ki11 and W22 (ref. 10), we analysed 1,451 of the largest (> 10 kb) deletions and found that 1,083 were supported by a clear reduction in read depth (Fig. 3c). The confirmed deletions occurred in regions of low gene density (4.4 genes per megabase compared to a genomewide average of 18.7 genes per megabase). Onethird (83 out of 257) of the genes missing in Ki11 or W22 lack putative orthologues in all four grasses (rice, sorghum, Brachypodium and Setaria), consistent with previous data 30 .
Although maize is often considered to be a largegenome crop, most major food crops have even larger genomes with more complex repeat landscapes 2 . Our improved assembly of the B73 genome, generated using singlemolecule technologies, demonstrates that additional assemblies of other maize inbred lines and similar highquality assem blies of other repeatrich and largegenome plants are feasible. Further highquality assemblies will in turn extend our understanding of the genetic diversity that forms the basis of the phenotypic diversity in maize and other economically important plants.
Online Content Methods, along with any additional Extended Data display items and Source Data, are available in the online version of the paper; references unique to these sections appear only in the online paper.

METhOdS
No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment. Whole-genome sequencing using SMRT technology. DNA samples for SMRT sequencing were prepared using maize inbred line B73 from NCRPIS (PI550473), grown at University of Missouri. Seeds of this line were deposited at NCRPIS (tracking number PI677128). Etiolated seedlings were grown for 4-6 days in ProMix at 37 °C in darkness to minimize chloroplast DNA. Batches of ~ 10 g were snapfrozen in liquid nitrogen. DNA was extracted following the PacBio protocol 'Preparing Arabidopsis Genomic DNA for SizeSelected ~ 20 kb SMRTbell Libraries' (http://www.pacb.com/wpcontent/uploads/2015/09/SharedProtocol PreparingArabidopsisDNAfor20kbSMRTbellLibraries.pdf).
Genomic DNA was sheared to a size range of 15-40 kb using either Gtubes (Covaris) or a Megarupter device (Diagenode), and enzymatically repaired and converted into SMRTbell template libraries as recommended by Pacific Biosciences. In brief, hairpin adapters were ligated, after which the remaining damaged DNA fragments and those without adapters at both ends were elimi nated by digestion with exonucleases. The resulting SMRTbell templates were sizeselected by Blue Pippin electrophoresis (Sage Sciences) and templates rang ing from 15 to 50 kb, were sequenced on a PacBio RS II instrument using P6C4 sequencing chemistry. To acquire long reads, all data were collected as either 5 or 6h sequencing videos. Construction of optical genome maps using the Irys system. Highmolecular mass genomic DNA was isolated from 3 g of young ear tissue after fixing with 2% formaldehyde. Nuclei were purified and lysed in embedded agarose as previously described 31 . DNA was labelled at Nt.BspQI sites using the IrysPrep kit. Molecules collected from BioNano chips were de novo assembled as previously described 32 using 'optArgument_human' . De novo assembly of the genome sequencing data. De novo assembly of the long reads from SMRT Sequencing was performed using two assemblers: the Celera Assembler PBcR -MHAP pipeline 33 and Falcon 34 with different parameter settings. Quiver from SMRT Analysis v2.3.0 was used to polish base calling of contigs. The three independent assemblies were evaluated by aligning with the optical genome maps.
Contamination of contigs by bacterial and plasmid genomes was eliminated using the NCBI GenBank submission system 35 . Curation of the assembly, including resolution of conflicts between the contigs and the optical map and removal of redundancy at the edges of contigs, is described in the Supplementary Information. Hybrid scaffold construction. To create hybrid scaffolds, curated sequence con tigs and optical maps were aligned and merged with RefAligner 32 (P < 1 × 10 −11 ). These initial hybrid scaffolds were aligned again to the sequence contigs using a less stringent P value (1 × 10 −8 ), and those contigs not previously merged were added if they aligned over 50% of their length and without overlapping previously merged contigs, thereby generating final hybrid scaffolds. Pseudomolecule construction. Sequences from BACs on the physical map that were used to build the maize v3 pseudomolecules were aligned to contigs using MUMMER package 36 with the following parameter settings: 'l(minimum length of a single match) 100 c(the minimum length of a cluster of matches) 1000' . To only use unique hits as markers, alignment hits were filtered with the following parameters: 'i(the minimum alignment identity) 98 l(the minimum alignment length) 10000' . Scaffolds were then ordered and oriented into pseudochromo somes using the order of BACs as a guide. For quality control, we mapped the SNP markers from a genetic map built from an intermated maize recombinant inbred line population (Mo17 × B73) 10 . Contigs with markers not located in pseudochromosomes from the physical map were placed into the AGP (A Golden Path) using the genetic map. Further polishing of pseudomolecules. Raw pseudomolecules were subjected to gap filling using Pbjelly (maxTrim = 0, minReads = 2) and polished again using Quiver (SMRT Analysis v2.3.0). To increase the accuracy of the base calls, we performed two lanes of sequencing on the same genomic DNA sample (library size = 450 bp) using Illumina 2500 Rapid run, which generated about 100fold 2 × 250 pairedend (PE) data. Reads were aligned to the assembly using BWAmem 37 . Sequence error correction was performed with the Pilon pipeline 38 , after aligning reads with BWAmem 37 and parsing with SAMtools 39 , using sequence and alignment quality scores above 20. Annotation. For comprehensive annotation of transposable elements, we designed a structural identification pipeline incorporating several tools, including LTRharvest 40  The MAKERP pipeline was used to annotate proteincoding genes 46 , integrat ing ab initio prediction with publicly available evidence from fulllength cDNA 47 , de novo assembled transcripts from shortread mRNA sequencing (mRNAseq) 48 , isoformsequencing (IsoSeq) fulllength transcripts 14 , and proteins from other species. The gene models were filtered to remove transposons and lowconfidence predictions. Additional alternative transcript isoforms were obtained from the IsoSeq data. Further details on annotations, core promoter analysis, and comparative phylogenomics are described in Supplementary Information. Structural variation. Leaves were used to prepare high molecular mass DNA and optical genome maps were constructed as described above for B73. Structural variant calls were generated based on alignment to the reference map B73 v4 chromosomal assembly using the multiple local alignment algorithm (RefSplit) 32 . A structural variant was identified as an alignment outlier 32,49 , defined as two wellaligned regions separated by a poorly aligned region with a large size difference between the reference genome and the map or by one or more unaligned sites, or alternatively as a gap between two local alignments. A confidence score was generated by comparing the nonnormalized P values of the two wellaligned regions and the nonnormalized loglikelihood ratio 50 of the unaligned or poorly aligned region. With a confidence score threshold of 3, RefSplit is sensitive to insertions and deletions as small as 100 bp (events smaller than 1 kb are generally compound or substitution and include label changes, not just spacing differences) and other changes such as inversions and complex events which could be balanced. Insertion and deletion calls were based on an alignment outlier Pvalue threshold of 1 × 10 −4 . Insertions or deletions that crossed gaps in the B73 pseudomolecules, or that were heterozygous in the optical genome maps, were excluded. Considering the resolution of the BioNano optical map, only insertion and deletions larger than 100 bp were used for subsequent analyses. To obtain highconfidence dele tion sequences, sequencing reads from the maize HapMap2 project 8 for Ki11 and W22 were aligned to our new B73 v4 reference genome using Bowtie2 (ref. 51). Read depth (minimum mapping quality > 20) was calculated in 10kb windows with step size of 1 kb. Windows with read depth below 10 in Ki11 and 20 in W22 (sequencing depths for Ki11 and W22 were 2.32× and 4.04× , respectively) in the deleted region were retained for further analysis. Data availability. Raw reads, genome assembly sequences, and gene annotations have been deposited at the NCBI under BioProject number PRJNA10769 and BioSample number SAMN04296295. PacBio wholegenome sequencing data and Illumina data were deposited in the NCBI SRA database under accessions SRX1472849 and SRX1452310, respectively. The GenBank accession number of the genome assembly and annotation is LPUQ00000000. A genome browser including genome feature tracks and ftp is available from Gramene: http://ensembl.gramene. org/Zea_mays/Info/Index. All other data are available from the corresponding author upon reasonable request.