DACCOR - Detection, characterization, and reconstruction of repetitive regions in bacterial genomes

The reconstruction of genomes using mapping-based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references, such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo . These regions can then be used to reconstruct them separately using short read sequencing data. Afterward, the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases. ABSTRACT 10 The reconstruction of genomes using mapping-based approaches with short reads experiences difﬁculties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases. Currently, the reconstruction of these regions is often based on modiﬁed references in which the repetitive regions are masked. However, for many references, such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo . These regions can then be used to reconstruct them separately using short read sequencing data. Afterward, the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR , which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.


22
The reconstruction of a genome from sequencing reads can be achieved using a de novo assembly, where 23 overlaps of the reads are identified and thus extended into longer continuous sequences called contigs. 24 Alternatively, if a closely related reference genome is available, the reads can be mapped against this 25 genome, where the reconstruction is based on the consensus of the reads for each base. 26 For the mapping-based approach, programs such as such as BWA (Li and Durbin, 2009) (Langmead and Salzberg, 2012) are used to align short reads generated by Next-Generation-Sequencing 28 (NGS) technologies to a known reference genome (Veeramah and Hammer, 2014). The consensus 29 sequence of the aligned reads can then be used to generate the genomic sequence of the newly sequenced 30 sample, assuming that the sample was sequenced with a sufficient coverage depth. This allows for the fast 31 identification of short insertions, deletions, and single nucleotide variations (SNVs) or single-nucleotide 32 polymorphism (SNP). The mapping programs typically calculate a score for each aligned read that 33 corresponds to the quality of the alignment (Li et al., 2008). The score quantifies the probability that a 34 read is placed at the correct genomic position. Reads with a low mapping quality can be filtered out to 35 remove reads that might stem from contaminations or were sequenced with low sequencing quality (Smith 36 et al., 2008). Besides bad quality reads, also reads mapping to repetitive regions could yield low-quality 37 scores if they cannot be mapped to a unique position. Filtering of reads with low mapping qualities would 38 also include these reads. This filtering is often conducted in the context of ancient DNA (aDNA) (Bos 39 et al., 2016), so that for such samples the repetitive regions of the respective reconstructed genomes 40 are generally affected, resulting in many unknown (N) characters. Also, de novo assembly approaches 41 have problems to reconstruct repetitive regions, at least those that are significantly longer than the read to be associated with outer membrane proteins, which suggests that they help pathogens to adapt to 48 their hosts (Denoeud and Vergnaud, 2004). In the case of the bacterium Treponema pallidum, repetitive 49 sequences in the arp gene are used to distinguish between the subspecies that cause veneral syphilis 50 (Treponema pallidum pallidum), nonveneral yaws (Treponema pallidum pertenue), and bejel (Treponema 51 pallidum endemicum), which is not possible using serological tests (Harper et al., 2008). 52 New sequencing technologies, like the Illumina SLR platform, PacBio, or Oxford Nanopore, can 53 create long reads that can span most repetitive regions to resolve them (Huddleston et al., 2014). However, 54 it is not always possible to apply these technologies to DNA samples. For  which is based on the sequencing reads and not on the respective reference genome. So far, no tool exists 73 that allows users to combine repeat identification and genome reconstruction from NGS data.

74
The general idea of our approach first starts with a de novo identification of all repetitive regions 75 in a given reference genome. Reads from NGS data are then mapped against the reference genome

82
DACCOR consists of three major steps (see central column of Figure 1). It first conducts a de novo 83 identification of all repetitive regions for a given reference genome. For each identified repetitive region 84 and each NGS data sample, the respective sequence is then reconstructed using the mapping-based 85 pipeline EAGER. For each sample, EAGER is also used to reconstruct a full genome. Finally, all repetitive 86 regions in the reconstructed genome are replaced by the individually reconstructed repetitive sequences.

87
To create an integrated version to identify repetitive regions, our de novo approach uses a k-mer based   First, all repetitive regions in a given reference genome are identified. DACCOR comes with its own de novo repeat identification tool, which is separated into six steps. However, DACCOR is not limited to this approach and it can be exchanged with any other repeat identification tool. Afterward, each repetitive region is reconstructed separately using EAGER. Finally, the individually reconstructed regions are combined with the reconstruction of the full genome by replacing the corresponding base pairs in the reconstructed genome with the ones of the reconstructed repetitive regions. This leads to an increased base pair resolution of these regions.
regions. This representation allows us to combine two previously identified repetitive regions that are 100 separated by these mismatches. If they only differ by one base after a mismatch marker is added to all 101 sequences, they are now represented as the same sequence and can be combined. This can be repeated to 102 allow for multiple mismatches. Matching pairs of separated regions can thus be combined into repetitive 103 regions with mismatches. In the last filtering step, leading and trailing mismatch markers, as well as 104 repetitive regions that do not fulfill the user-defined criteria (e.g., length of the repetitive regions), are 105 removed. All remaining mismatch markers are replaced with the character N (see supplementary Figure 1 106 for an example). However, this approach can only handle base pair mismatches but no insertions or 107 deletions. If indels exist within the repetitive regions, they will be identified separately and are still used 108 in the following steps to reconstruct them.

109
The methodology for identifying repetitive regions, as described above, is implemented in the   The reconstruct subprogram of DACCOR automatically generates EAGER configuration files for 138 a given reference, its identified repetitive regions, and multiple sequencing samples.

139
The combine subprogram uses the EAGER output of the individually reconstructed repetitive regions, To be able to automatically assemble a genome with all its repetitive regions, the subprogram 145 pipeline connects the described subprograms identify, reconstruct, and combine. plasmids (see Table 1 for RefSeq IDs). We first compared the step of the identification of repetitive  Table 1 for SRA project 154 IDs) to reconstruct the sequences of the 16S and 23S rRNA, which are duplicated in the bacterium T. Manuscript to be reviewed where contigs mapped were counted as resolved bases and the ones where no contig mapped as unresolved.

171
They were then compared with the results of EAGER without repeat resolution and DACCOR with repeat 172 resolution.

173
If not specified differently, all programs were run with default parameters.

175
We first evaluated the identification of repetitive regions by comparing these to the ones identified by 176 VMatch. This comparison (see Table 2) shows that the results of both programs are almost identical. We 177 considered VMatch as the "golden truth" and could, therefore, compute an accuracy for the repetitive 178 regions reported by DACCOR. We achieved an excellent accuracy with over 99% in all tested cases with 179 only very few false negatives as well as false positives.

180
Next, we reran DACCOR with a higher sensitivity allowing for up to five mismatches in repetitive 181 regions of lengths at least 101 base pairs. These results of the different bacterial genomes (see Table 3) repetitive regions respectively, and only 2.0% of the T. pallidum genome is repetitive.

194
Next we assessed the runtime of DACCOR, which was calculated on a server with four Intel(R) Manuscript to be reviewed

221
Using the individual reconstructions of the 16S rRNA and the 23S rRNA gene sequences, we also 222 tried to identify SNVs or SNPs that are specific for only one copy of the respective gene. Hereafter we 223 refer to these variations using the term SNP, though we cannot confirm that the variation is present to 224 a minimum degree within a population (for example ≥ 1%. The results of this analysis identified 36 225 positions that have a SNP call in at least one of the 106 samples for the 16S rRNA (see Table 4). Of those  Table 5. Statistics for the assembly with SPAdes of the two clinical syphilis samples (AR1 and AR2 from Arora et al. (2016)). The sample column specifies the postprocessing step: "filtered" refers to contigs of 1,000 bp or more and "mapped" are only those contigs that could be mapped against the reference genome of T. pallidum.   Manuscript to be reviewed The comparison between DACCOR and VMatch showed that VMatch is slightly more sensitive,

268
probably due to its suffix array approach. A possibility to increase the sensitivity of DACCOR would be 269 to elongate all identified repetitive regions based on a local alignment. In principle, DACCOR could also 270 use Vmatch, as well as other repeat finding methods, to replace our de novo repeat finder. However, to 271 automate this, further adaptions would be necessary.

272
When mapping short reads from typical NGS data, a number of approaches recommend to use together with a margin at the 5' and 3' end of the region. We propose to set the margin to the length of the 290 longest read that will be used in the mapping.

291
Using DACCOR, we have shown that a higher base pair resolution in repetitive regions, compared to 292 the reconstruction using the standard mapping-based approach, can be achieved. In the standard approach, 293 reads that can be mapped to different locations result in a mapping quality of zero, which in turn decreases 294 the genotyping quality (McKenna et al., 2010). Additionally, our results show that de novo assembly also 295 cannot reconstruct all repetitive regions completely, even though the results are better than the standard 296 mapping-based approach without repeat resolution. By mapping to the repetitive region only, these reads 297 have a higher mapping quality, a higher genotype quality, and thus result in more resolved positions.

298
However, one has to acknowledge that other unresolved bases that may stem from a lack of coverage or 299 low sequencing quality, cannot be resolved with our approach. In the case of the two syphilis samples, we  Manuscript to be reviewed In conclusion, we have developed an entirely automatic pipeline that first conducts a de novo repeat 306 identification in bacterial genomes and then uses the repetitive regions for an enhanced mapping of short 307 read NGS data. Increasing the resolution of a draft genome affects many downstream analyses, such as 308 population genetics or phylogenetic analyses. For future improvements, we plan to reduce the runtime 309 and memory usage by adjusting our data structure and by adding more parallelization to some of the 310 computing steps. With this, we hope to eventually be able to identify repetitive regions also in large 311 eukaryotic genomes, like the human genome. 313 We have developed an automated software pipeline, written in Java, which allows other researchers to use 314 our methodology. This pipeline is available on github: