Population Level Variation of Transposable Elements in a Maize Diversity Panel

Intact transposable elements (TEs) account for 65% of the maize genome and can impact gene function and regulation. Although TEs comprise the majority of the maize genome and affect important phenotypes, genome wide patterns of TE polymorphisms in maize have only been studied in a handful of maize genotypes, due in part to the challenging nature of assessing highly repetitive sequences. We implemented a method to use short read sequencing data from 509 diverse inbred lines to classify the presence/absence of 494,564 non-redundant TEs that were previously annotated in four genome assemblies including B73, Mo17, PH207, and W22. Different orders of TEs (i.e. LTRs, Helitrons, TIRs) had different frequency distributions within the population. Across the different orders, TE family size was negatively correlated with average population frequency of TEs in the family and nested TEs are at lower frequency than non-nested TEs. Age of LTR elements was positively correlated with population frequency. Comparison with SNP data revealed that while a majority of TEs are tagged by nearby SNPs (r2 > 0.9) there are also many TEs in low to moderate linkage disequilibrium with nearby SNPs. This study provides a population scale genome-wide assessment of TE variation in maize, and provides valuable insight on variation in TEs in maize and factors that contribute to this variation.


INTRODUCTION
Transposable elements (TEs) are present in all eukaryotic genomes (BENNETZEN 2000;WICKER et al. 2007). In maize, 65% of the genome is made up of intact TEs (JIAO et al. 2017), and another 20% is comprised of fragmented TEs (SCHNABLE et al. 2009). There are many examples of phenotypic effects of TEs from null mutations, such as maize kernel color (SELINGER AND CHANDLER 2001), white wine grapes (CADLE-DAVIDSON AND OWENS 2008), and color variation in the common morning glory (CLEGG AND DURBIN 2000). TE insertions can also affect positive and negative gene regulatory functions, such as insertion of a long terminal repeat LTR retrotransposon into the promoter region of the Ruby gene in oranges that leads to red fruit flesh of blood oranges (BUTELLI et al. 2012), and an LTR retrotransposon that is associated with red skin color in apples (ZHANG et al. 2019). TE insertions in maize have also been associated with genes that are upregulated in response to abiotic stress (MAKAREVITCH et al. 2015).
TEs are classified into two classes, depending on how they replicate, and from there into superfamilies and families, by sequence similarity (WICKER et al. 2007). Class I elements, or retrotransposons, replicate via an RNA intermediate (BENNETZEN 2000;LISCH 2013). Long terminal repeat retrotransposons (LTR) are the most abundant type of retrotransposons in maize (BENNETZEN 2000) and account for over half of the maize genome by sequence length STITZER et al. 2019). Class II elements, or DNA transposable elements, replicate via a DNA intermediate, and the two largest orders are Helitrons and terminal inverted repeat (TIR) elements. TIRs are defined by terminal inverted repeat sequences at both ends of the TE (WICKER et al. 2007) and intact TIRs make up around 3% of the maize genome .
Helitrons are defined by their 'rolling circle' replication mechanism (LISCH 2013) and intact Helitrons make up around 4% of the maize genome . TEs are found throughout the maize genome, including near and in genes, and can be anywhere from a few hundred base pairs to >10 kb in length (BENNETZEN 2000), and range in age, from very recent insertions to >2 million years old insertions .
Studies done on TEs in maize have shown that there is extensive variation in TE insertion presence/absence patterns at specific loci across maize inbreed lines (SANMIGUEL et al. 1996;FU AND DOONER 2002;MORGANTE et al. 2005;DOONER et al. 2019). Early work on the bronze locus in multiple maize lines found that different lines differed in not only the gene order and content but also in TE content (FU AND DOONER 2002). More recent work on mutations in the same region found that not only were high mutation rates due to TE insertions, but that different TEs were inserting in different maize lines (DOONER et al. 2019). Whole genome analysis of four maize genomes with de novo TE annotations revealed extensive TE polymorphism between maize lines on a whole genome scale . On average, about 500 Mb of TE sequence, or ~20% of the maize genome, was variable between the four inbred lines included in the study: B73, Mo17, PH207 and W22. Another 1.6 Gb of TE sequence was only shared between two or three of the lines.
Despite the widespread prevalence of TEs in the maize genome, connection of TE insertions to functional phenotypic variation, and evidence of widespread TE polymorphism between maize genotypes, there have been very few scans of specific TE insertion frequencies in divergent maize populations. The analysis of the frequency of TE insertions can provide insights into the level of variability for TEs and help understand the presence of common and rare TE variants. To understand patterns of TE polymorphism on a genome-wide scale, we utilized short 5 read sequencing of 509 diverse maize lines to score TE presence/absence of 494,564 nonredundant TEs that were annotated in four reference genome assemblies. This study provides a genome-wide analysis of TE polymorphism frequencies across a large number of genotypes as we continue to try to understand how TEs contribute to phenotypic variation and adaptation within species.

MATERIALS AND METHODS
Whole genome resequencing: A subset of 509 lines from the Wisconsin Diversity Panel (HANSEY et al. 2010;HIRSCH et al. 2014;MAZAHERI et al. 2019) was used for this study (Table S1). For 57 genotypes, available short-read sequence data was downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA ; Table S1). These samples ranged in theoretical coverage of 10-55x sequencing depth. For 454 genotypes, plants were grown under greenhouse conditions (27C/24C day/night and 16 h light/8 h dark) with five plants of a single genotype per pot for DNA extractions. Plants were grown in Metro-Mix 300 (Sun Gro Horticulture) with no additional fertilizer. Tissue from the newest leaf of each seedling in the pot was collected and immediately flash frozen in liquid nitrogen at the Vegetative 2 developmental stage. DNA was extracted using a standard cetyltrimethyl ammonium bromide (CTAB) DNA extraction protocol (SAGHAI-MAROOF et al. 1984), and treated with 25 uL of RNase at 39℃ for 30 minutes. Genomic DNA for each genotype was submitted to Novogene (Novogene Co., Ltd., Beijing, China) for whole genome sequencing with 150 base pair paired end reads generated on a HiSeq X Ten sequencing machine. For each genotype at least 20x theoretical sequencing depth was achieved. After assessing accuracy of different TE calling parameters the following criteria were chosen to balance accuracy and the number of ambiguous calls: 1) A TE was called present if the mean coverage over the 10 bp just after the start and just before the end of the TE were both greater than or equal to 8x, 2) A TE was called absent if the mean coverage over the 10 bp just after the start and just before the end of the TE were both less than 7x, and 3) all other TEs were called as ambiguous that did not meet either of these criterion. TEs for which the alignment of short reads to its own genome assembly did not result in a present call were removed from downstream analyses. homozygotes or 50 heterozygotes) and to filter out genotypes called in fewer than 90% of the individuals in the population. In total, 3,146,253 SNPs remained after all of these filtering steps.
These SNP calls were used to conduct linkage disequilibrium analysis with the TE presence/absence scores described above. For this analysis, TE presence/absence data from only the alignments to B73 reference genome assembly were used. Any SNPs located within TEs were removed. Linkage disequilibrium between SNPs and TEs was calculated using plink v1.90b6.16 (PURCELL et al. 2007) with the --make-founders option to calculate LD among all inbred lines, -allow-extra-chr to calculate LD in extra scaffolds, --ld-window-r2 0 to report r2 in the 0 to 1 range (default is 0.2 to 1), and --ld-window 1000000 --ld-window-kb 1000 to calculate LD within 1mb windows. The SNP in the highest LD with a TE within the 1 Mb window was used for downstream visualization. Principle components analysis of SNPs and TEs were conducted using plink v1.90b6.16 (PURCELL et al. 2007).

Data Availability:
All sequence data is available on the NCBI SRA (BioProject PRJNA661271, Table   S1). Code for this study is available at https://github.com/HirschLabUMN/TE_variation.

RESULTS AND DISCUSSION
Using short read sequence data to identify TE presence/absence: For this study, we implemented an approach for scoring TE presence/absence variation from short read sequence alignment using the average coverage of windows surrounding the boundaries of previously annotated TEs. Iterations of the approach were evaluated for accuracy using false positive, true positive, false negative, true negative rates. For the accuracy assessment, presence/absence scores defined by previous comparison of TE content generated for four maize genome assembles including B73, Mo17, PH207, and W22 were used as truth . It should be noted that some polymorphisms in this 'truth' set might be wrong for reasons described in the publication from which they were generated. As such, it will not be possible to get 100% accuracy in comparison to this set unless the same miscalls are generated in two independent methods. Still, this set represents a high quality set of TE polymorphisms for which to assess the relative accuracy of different parameters. The following parameters were factorially tested: 1) mean coverage of the windows used to classify a TE as present (2x-8x coverage), 2) size of the windows around the start and end of the TE (1, 3, 5, and 10 bp), and 3) coverage before and after the start and end of each TE (4 sides) versus only coverage just after the start and just before the end of the TE (2 sides) ( Figure S1).
Initially, a TE was only classified as present or absent. To be classified as present the mean coverage over all genome fragments used had to be greater than or equal to the coverage threshold. If mean coverage was less than the threshold over any genome fragment at the start or end of the TE in question, the TE was classified as absent. From this initial factorial analysis, coverage threshold was identified as an important factor for TE calling accuracy ( Figure 1). In contrast, window size did not have a major impact on TE calling accuracy, nor did using coverage information from only just after the start and just before the end of a TE versus before and after the start and end of the TE boundaries ( Figure 1). It is notable that the TE classification accuracy rate patterns differed in W22, compared to Mo17 and PH207, where the mean genome-wide coverage was substantially lower (6x average mean coverage for W22) versus 28x and 30x for Mo17 and PH207).
Further investigation showed that ~11% of the TEs had inconsistent coverage information, where coverage at one side of the TE was above the coverage threshold but the other side was below the threshold ( Figure S2). In the original method, these TEs were called as absent, but 42% in PH207 and 50% in Mo17 are actually present based on the prior genomewide comparisons. This is a much higher error rate than the overall error rate of ~10%. Based on this observation, a third possible TE classification category of ambiguous was added. With this new category, the criteria for classifying a TE as present did not change (i.e. all windows included in the analysis had to be above the given coverage threshold), but the criteria for classifying a TE as absent was changed to require that the mean coverage over all genome fragments of the TE had to be less than the coverage threshold minus one. All other TEs were classified as ambiguous ( Figure S2).
Within our panel of sequenced genotypes, we had a range of observed genome wide mean coverages ( Figure S3). To test the effects of coverage on TE classification accuracy rates the Mo17 and PH207 reads were down sampled to 20%, 32%, 45%, 66% and 83% of their original coverage levels. This corresponded to genome-wide mean coverages of ~6x, ~10x, ~15x, ~20x, and ~26x. Reducing mean coverage from 30x to 15x had a neglectable effect on TE classification accuracy at a given coverage limit threshold ( Figure S4), but reducing genome wide mean coverage from 30x to 6x had a noticeable effect on TE classification accuracy ( Figure S4). As long as mean coverage was greater than 10x, classification rate accuracy was not affected by sequencing coverage. It is worth noting that the accuracy rates of Mo17 and PH207 down sampled to 6x genome wide coverage were similar to those of W22, which only had 6x mean coverage, which indicates that the down sampling accurately mimicked patterns of coverage for genotypes that have low mean coverage.
From these assessments of accuracy across different parameters a final criterion was identified to balance inaccurate classifications and the rate of ambiguous classifications. This final criterion required the average coverage of the two 10 bp internal flanking windows to be greater than or equal to 8x to be classified as present, and both sides less than 7x average coverage to be classified as absent. All other TEs were classified as ambiguous. Highly polymorphic flanking regions in some haplotypes will result in a higher ambiguous classification rate for particular TEs that could impact downstream data analysis of TE polymorphisms. Thus, TEs with great than 25% ambiguous calls were filtered from further analyses. These criterion and filtering methods provide a set of TE presence/absence calls that allows for population level analysis of TE frequencies on a genome-wide scale in maize. Still, it should be noted, that rare alleles will be under-represented in this set due to the fact that only those TEs that were previously annotated in at least one of four de novo assembled maize genomes are included in this study. These results suggest that TEs present in small TE families are much more likely to exhibit variability in maize populations but it is worth noting that individual members of large families are often polymorphic too. frequency, which suggests that older LTRs were generally more frequent in the population than younger LTRs (Figure 3). It is worth noting, however, that there are a large number (n=23,506) of LTRs that are very young (LTR similarity above 95), and are at high frequency (>80%) in the population. These TEs are potentially under selection or linked to other features in the genome under selection that could have driven their rapid rise in frequency in the population.

Age of LTRs is positively correlated with population frequency:
Nested TEs are less frequent and younger than non-nested TEs: Nested TEs exhibited higher levels of moderate frequency (20-80%) and low frequency (<20%) TEs as compared to the nonnested TEs (Figure 2; Figure S8). In all three orders, the frequency distribution of nested and nonnested TEs was significantly different (KS test, p-value < 2x10-16). This trend towards lower frequency TEs in the nested elements is likely a reflection of the younger age of these elements that have inserted into pre-existing TEs ( Figure S9), and therefore do not follow the same distribution as the non-nested TEs. We further explored this difference in frequencies between nested and non-nested elements by plotting the frequency of each nested TE versus the frequency of the TE in which it is nested (Figure 4a). Overall, there was a strong correlation between the nested TEs and the TE into which it was nested (r 2 = 0.578; Figure 4a), which is expected. There was a subset of nested-outer elements that did not fall on the diagonal. Within this subset there was a much great portion that had a higher frequency in the outer element than in the nested element (Figure 4b).This is expected as the outer element must first be present before the nested element is able to insert as a nested element.
Within the set of outer elements that were fixed or nearly fixed in the population (frequency greater than 0.95), there was a continuous range of frequencies for inner elements that likely represent nested insertions with a range of ages. Using LTR similarity as a proxy to LTR age, we tested if there is a relationship between the age of the nested element contained within fixed elements and their frequency in the population. Indeed, those nested elements that were at lower frequency in the population were younger (higher LTR similarity) than those that were at high frequency in the population (Figure 4c). There were also outer elements in a range of frequencies that had extremely rare nested TEs that are likely very young or deleterious nested insertions.

Many TE insertional polymorphisms are not tagged by SNP markers:
There is a limited number of plant species in which a species or population level cataloging of TE presence/absence variation has been conducted at a genome-wide level (QUADRANA et al. 2016;STUART et al. 2016;DOMINGUEZ et al. 2020). As such, the contribution of TEs to phenotypic variation, utility for genomic prediction, and other applications linking genotypes and phenotypes has not been directly evaluated. In many instances, a linked marker has been identified as significant and with fine-mapping it is revealed that the causative variant is in fact a polymorphic TE insertion (e.g. (STUDER et al. 2011)). We sought to test the extent to which the extensive number of polymorphic TE insertions that were classified in this study were in linkage disequilibrium (LD) with genomewide SNP markers to begin to understand the extent to which phenotypic variation that is caused by TEs has or has not been accounted for in previous studies that used only SNP markers. For this analysis, all SNPs within 1Mb plus or minus a TE were evaluated and the SNP that was in the highest LD was recorded. The majority of TEs were in moderate (r 2 0.5-0.9) to high (r 2 >0.9) LD with a nearby SNP marker, but there was a subset that were in low LD (r 2 < 0.5) with all SNPs within a 1Mb window of the TE (Figure 5a). LTRs had the highest portion of TEs in high LD with a SNP (~73%), while Helitrons had ~61% in high LD and TIRs had only ~54% in high LD (Figure 5b).

18
Of the subset of TEs that had a SNP in high LD (>0.9), the majority of the SNPs were within 200Kb of the TE (Figure 5c). This distance increased when TEs in all levels of LD with SNPs were included ( Figure S10). TEs that were at very high or low frequency within the population were generally in low LD with SNPs, and those that were at moderate frequency in the population were in moderate to high LD with SNPs (Figure 5d-f). This trend was consistently observed across orders of TEs (Figure 5d-f). This analysis demonstrates that while TE presence/absence patterns generally reflect maize breeding history ( Figure S6), there are TE insertional polymorphisms that are not tagged by SNPs. This lack of LD is generally related to their frequency in the population, and rare TEs variants, that may be of important phenotypic consequence are the ones that are not being tagged. Including TE insertional polymorphism will likely be important in understanding the full complexity of phenotypic trait variation and location adaptation, and developing improved maize varieties in the future.         Figure 4. Relationship between population frequency of nested elements and the elements in which they are nested. a) Proportion of genotypes a TE is present in between nested TEs and the TE in which the element is nested. Only low ambiguous TEs were included in this plot. Only TEs from the B73 genome are plotted. b) Distribution of the proportion of genotypes the outer TE is present in minus proportion of genotypes the nested TE is present in. c) Age distributions for nested elements that are nested in TEs that are fixed or nearly fixed (frequency >0.95) in the population. LTR similarity is used as a proxy to age which higher LTR similarity corresponds to a younger TE and lower LTR similarity corresponds to an older TE. This plot only contains nested LTRs as age estimates are not available for other orders.   Figure S3. Realized genome-wide mean coverage for samples used in this study. Figure S4. TE calling accuracy for Mo17 and PH207 down sampled to different coverage levels. Coverage over the two internal flanking windows were used to call a TE as Present, Absent, or Ambiguous. Ambiguous calls were exclude from TE calling rates. The majority of samples used in this study have realized coverage between 10-30x. Figure S5. TE frequency distribution in a panel of diverse maize inbred lines. Genotypes include 508 individuals for the Wisconsin Diversity Panel (Hansey et al., 2010). Frequencies are for TEs annotated in the B73 v4 reference genome assembly .  Proportion of genotypes a TE is called as Present in Figure S6. Principle component analysis (PCA) of SNPs (left) and TE presence/absence (right) in a set of diverse maize inbred lines. TEs and SNPs were called relative to the B73 reference genome assembly. Figure S7. Proportion of genotypes a TE was called as 'Present' in between homologous TEs in B73 and Mo17. Only low ambiguous TEs included. Figure S8. Proportion of low, medium, and high frequency TEs in a panel of diverse inbred lines. Genotypes include 508 individuals for the Wisconsin Diversity Panel (Hansey et al., 2010). Frequencies are for the set of non-redundant TEs across four genome assemblies  Distance (Kb) Figure S10. Distance between TEs and the SNP with the highest LD to it. Distance is calculated as the middle of the TE to the SNP. Only SNPs within 1Mb of a TE were evaluated. Figure 1. TE calling accuracy rates. a) TE calling accuracy rates using 4 windows flanking the inside and outside of the TE boundaries to call a TE as present or absent. b) TE call accuracy rates using the two internal windows (immediately after the start and immediately before the end) of the TE. All TEs were called as either present or absent in this analysis (i.e. no ambiguous call category was implemented).   Figure 4. Relationship between population frequency of nested elements and the elements in which they are nested. a) Proportion of genotypes a TE is present in between nested TEs and the TE in which the element is nested. Only low ambiguous TEs were included in this plot. Only TEs from the B73 genome are plotted. b) Distribution of the proportion of genotypes the outer TE is present in minus proportion of genotypes the nested TE is present in. c) Age distributions for nested elements that are nested in TEs that are fixed or nearly fixed (frequency >0.95) in the population. LTR similarity is used as a proxy to age which higher LTR similarity corresponds to a younger TE and lower LTR similarity corresponds to an older TE. This plot only contains nested LTRs as age estimates are not available for other orders.  Figure S1. Schematic of factors tested for accuracy of TE calling using overage of short read sequence alignments in windows surrounding annotated transposable elements. a) variable coverage within windows to call a TE as present. b) variable window sizes. c) internal and external flanking windows versus only internal windows.  Figure S3. Realized genome-wide mean coverage for samples used in this study. Figure S4. TE calling accuracy for Mo17 and PH207 down sampled to different coverage levels. Coverage over the two internal flanking windows were used to call a TE as Present, Absent, or Ambiguous. Ambiguous calls were exclude from TE calling rates. The majority of samples used in this study have realized coverage between 10-30x. Figure S5. TE frequency distribution in a panel of diverse maize inbred lines. Genotypes include 508 individuals for the Wisconsin Diversity Panel (Hansey et al., 2010). Frequencies are for TEs annotated in the B73 v4 reference genome assembly .  Proportion of genotypes a TE is called as Present in Figure S6. Principle component analysis (PCA) of SNPs (left) and TE presence/absence (right) in a set of diverse maize inbred lines. TEs and SNPs were called relative to the B73 reference genome assembly. Figure S7. Proportion of genotypes a TE was called as 'Present' in between homologous TEs in B73 and Mo17. Only low ambiguous TEs included. Figure S8. Proportion of low, medium, and high frequency TEs in a panel of diverse inbred lines. Genotypes include 508 individuals for the Wisconsin Diversity Panel (Hansey et al., 2010). Frequencies are for the set of non-redundant TEs across four genome assemblies   Distance (Kb) Figure S10. Distance between TEs and the SNP with the highest LD to it. Distance is calculated as the middle of the TE to the SNP. Only SNPs within 1Mb of a TE were evaluated.