Harmonization of whole genome sequencing for outbreak surveillance of Enterobacteriaceae and Enterococci

Whole genome sequencing (WGS), is becoming the facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. We performed a three-center ring trial to assert if inter-laboratory harmonization of WGS is achievable, for this goal. To this end, a set of 30 bacterial isolates comprising of various species belonging to the Enterobacteriaceae and Enterococcus genera were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data was analysed by 1 centre using BioNumerics (6.7.3) for i) genotyping origin of replications & antimicrobial resistance genes, ii) core-genome (cgMLST) for E. coli and K. pneumoniae & whole-genome multi locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, only in 2/27 and 3/15 comparisons a discrepant allele was called between two genomes, for E. coli and K. pneumonia, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0-11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonisation of data processing surveillance of bacterial outbreaks. Summarizing, multi-center WGS for bacterial surveillance is achievable, but only if protocols are harmonized.

were from a previously acquired collection, stored at Antwerp University. The strains were collected 175 from perianal swabs of hospitalized patients (21) and clients in nursing homes (6), and from feces 176 from broilers (2) and weaned pigs (1) by selective culturing. The culturing methods were described 177 elsewhere 21,23 . All strains originated from The Netherlands and Belgium. An overview of strains and 178 their origin is available in table 1. Strains were inoculated from -80°C on Mueller Hinton II agar (BD,179 Franklin Lakes, NJ, USA) and sent to the participating institutes. The 30 strains were divided in three 180 sets of ten strains. Each set was sequenced once by each center with a six-month interval between 181 each set. 182

DNA isolation and Whole genome sequencing 183
The DNA isolation procedure and WGS procedure was performed as follows: DNA was extracted 184 using the MasterPure DNA isolation kit (Lucigen) or MasterPure Gram Positive DNA purification kit 185 (Lucigen). Sequencing libraries were prepared using NexteraXT (Illumina), freely choosing index 186 primers. Libraries were sequenced on the Illumina MiSeq platform in paired end 2x250 base pairs 187 (bp) reads using MiSeq V2 cartridge. Preferably, each set of strains was subjected to WGS in a single 188 run. Acceptance criteria for WGS were a de novo assembly with an average coverage higher than 30 189 and less than 1000 contigs, as reported in BioNumerics (7.6.3). Samples not fulfilling acceptance 190 criteria were re-sequenced . The accession numbers for the raw sequencing data are available in table  191 1. Generated data was shared among institutes and analysis of all generated datasets (n=90) was 192 performed in one institute. 193

Statistical analysis 194
Statistical analyses were done using scipy.stats module (V1.3.1) 24 and the statsmodel.api package in 195 Python (v3.7). 196 cgMLST and wgMLST allele calling and genotyping 197 Raw sequencing reads were assembled using a custom pipeline in BioNumerics (7.6.3) employing 198 SPAdes 25 (v3.7.0) for its de novo assembly. From the raw reads and the de novo assembly, alleles 199 were called for core genome and whole genome MLST (cgMLST/wgMLST). In BioNumerics, cgMLST 200 schemes were only available for E. coli and K. pneumoniae consisting of 2513 and 634 fixed loci, 201 respectively. Pairwise allelic distance was determined by counting the number of discrepant allele 202 variants between two datasets, ignoring loci that were not present in both datasets. Resistance 203 genes and Origins of replication (ORI) were determined using BLAST 26 and two custom databases 204 based on Resfinder 27 and PlasmidFinder 28 . AMR genes were called with a using 90% identity and 60% 205 length cutoff. ORIs were called using 95% identity and 60% length cutoff. In total, 90 WGS datasets 206 were generated. As no gold standard with regard to true genotype of each strain was available, the 207 following rules were applied: (i) If either two or three out of three datasets of a strain had a specific 208 genotype, this was considered as a true positive observation; (ii) If only one out of three datasets of a 209 strain had a specific genotype, this was considered as a false positive observation; (iii) If a different 210 allelic variant was observed (i.e two blaTEM-1B and one blaTEM-116) this was noted as a discrepancy 211 and counted as a false positive. 212

wgSNP analysis 213
To determine the best de novo assembler to use for wgSNP analysis, the assembler generating the 214 least amount of pairwise SNPs (using SKA), among assemblies of the same strain, was chosen. To 215 avoid complexity, only the E. coli dataset of this study was used. The following assemblers were used: 216 I) SPAdes (v3.14.0) 25 , II) SKESA (parameters: '-use-paired_end', v2.3.0) 29 , III) Megahit 30 (v1.2.9). All 217 tools were used in default settings, unless specified otherwise. Additionally, the assembly free 218 method to determine SNPs straight from the raw reads, using "SKA fastq", was also used in this 219 comparison. The complete workflow is available at "https://github.com/MUMC-220 MEDMIC/assemblercompare" (v1.0). SKA 13 was used to determine SNPs on a whole genome level, 221 using a split k-mer length of 31. In short, pairwise SNPs were determined by generating a profile of 222 split k-mers, in which the middle base may vary ("SKA fasta" for assembly-or "SKA fastq" for read 223 based SNP profiling). The number of SNPs, between two datasets, was determined by comparing the 224 split k-mer files ("SKA distance"). All data preprocessing for the SNP based data analysis was 225 performed using Snakemake 31 as workflow manager. 226

Data availability 227
All raw sequencing data was deposited at EBI-ENA under BioProject PRJEB40571. 228

230
The assembly coverage, or the depth of coverage, of all strains ranged from 30 to 203 (figure 1A). The 231 N50 score, indicative for how fragmented a de novo assembly is, ranged from 33.712bp (E. faecium) 232 to 942.715 bp (K. pneumoniae) and showed clear species dependence (FIG 1.B). Assemblies of E. coli, 233 E. faecalis and E. faecium showed a lower N50 score, indicating the difficulties of assembling such 234 genomes ( Figure 1B). The number of contigs also varied per species, and overall had a significant 235 negative correlation with the sequencing depth (P < 0.01, Spearman rank correlation, FIG1.F ).

12
The number of wgMLST alleles called ranged from 1933 (Citrobacter sp.) alleles to 5493 (K. 237 pneumoniae, Figure 1C). Furthermore, the average number of alleles per kilobases (kb) ranged from 238 0.41 to 0.98. A significant positive trend for the normalized allele count and sequencing depth was 239 observed (P < 0.05, Spearman rank correlation FIG1.G). Surprisingly, mainly the Citrobacter datasets 240 seemed to showed a low coding density (range 0.41 to 0.65) compared to the median of the entire 241 dataset (0.83). Further inspection of the Citrobacter genome assemblies using BLAST webservice 242 (https://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed 1-4-2020), showed low homology (~85% DNA 243 identity score) to known Citrobacter strains available in the NCBI database (accessed on 1-4-2020, 244 data not shown). 245 One dataset of E. faecium-1 had an unusual large genome size of 5.4Mb ( Figure 1E). This strain also 246 had a higher number of contigs; (636, median of 274 for E. faecium, Figure 1F), and showed a lower 247 number of alleles per kb (0.43, median of 0.84. Figure 1D) compared to the other E. faecium 248 datasets. This indicates contamination in the NGS dataset of a non-E. faecium microbe. Manual 249 inspection of the assembly, using BLAST webservice, showed the presence of contigs belonging to 250 Cutibacterium (formerly known as Propionibacterium), a skin commensal and previously described as 251 a common contaminant of NGS datasets 32-34 . 252

Resistance genes and plasmid ORIs 253
Overall, a good consensus was obtained for the genotyping of plasmid ORIs and AMR genes ( Figure  254 2A and 2B). A total of 973 AMR genes and ORIs were called with a precision of 99.0% and sensitivity 255 of 99.2%. For four strains, a genotype was not called in one of the datasets. The missed genotypes 256 were for E. cloacae-2 a sul1 gene, for E. coli-6 a tet(A) gene, for E. faecalis-1 an aac(6')-aph(2") gene, 257 and for E. faecium-2 an aph(2'')-Ia gene. For Citrobacter-2 and K. oxytoca-2, a false discovery of a 258 blaTEM-116 was observed, as this genotype was not called in either of the other two datasets of 259 these strains. For four strains, a discrepant genotype was called. These discrepancies were observed 260 for K. aerogenes-2 (blaTEM), for E. cloacae-2 (aadA), and for K. oxytoca-1 (blaOXY and blaTEM). 261 13 Twice, an unexpected ColpVC was found in a K. oxytoca-2 and K. pneumoniae-4 dataset, which were 262 from two different centers, indicating either loss of this plasmid in the other isolates of this strain, or 263 contamination during DNA isolation or library preparation ( Figure 2A). 264

Inter-laboratory variation in cgMLST profiles 265
To assess the baseline genetic variation of identical strains when these strains were sequenced in 266 different sequencing institutes, we compared the cgMLST and wgMLST profiles among the strains 267 from the three participating institutes. Only for E. coli and K. pneumoniae cgMLST schemes were 268 available for use in BioNumerics and, on average, 2441 (97.1%) and 615 (97.1%%) core genome 269 alleles were called for E. coli and K. pneumoniae respectively (Supplemental Figure 1). In total, 27 and 270 15 pairwise allelic distances were calculated among the nine E. coli and five K. pneumoniae strains. In 271 25/27 (93%) and 12/15 (80%) comparisons, a perfect concordance of cgMLST profiles was observed. 272 In the discordant comparisons, only one allele was differently called between two datasets 273 (Supplemental figure 2). 274 For Citrobacter sp. a highly diverse number of wgMLST alleles called was observed for the four 285 different strains that were selected for this study. This ranged from 1933 to 4426 alleles that were 286 called, even though the genome size did not vary strongly (mean 4.88Mb, range 4.66Mb to 5.31Mb) 287 and much lower normalized allele counts (mean 0.61, range 0.41 to 0.83) were observed than in 288 other species in this study (mean 0.84, range 0.43 to 0.98). Therefore, determination of the 289 variation on wgMLST for Citrobacter cannot be determined in this study as an incomplete set of 290 alleles were called. 291

Reference free wgSNP 292
As mutations in the genome can also arise in intergenic regions (which are not taken into account in 293 MLST based methods), all assemblies of each isolate were screened using pairwise SNPs. Pairwise 294 SNPs were determined using split k-mers with a variable middle base-pair, as implemented by SKA 13 . 295 First, the most optimal assembler for this task was chosen. For this, we determined the inter-and 296 intra-assembler variation introduced on the number of pairwise SNP between two de novo 297 assemblies. The best assembler was chosen based on the one that introduced the least number of 298 pairwise SNPs in the datasets from the same strains with the intra-assembler comparison. To reduce 299 complexity, only the E. coli dataset was used. Secondly, the number of pairwise SNPs was 300 determined of the entire dataset employing the best suited assembler. Additionally, also the 301 assembly free method for determining SNPs was used as in implemented by SKA. 302 The mean intra-assembly variation ranged from, 0.2 SNPs (assembly free), 2.7 SNPs (SKESA), 26.6 303 SNPs (SPAdes), up to 77.8 SNPs (Megahit) ( Figure 4ABCD). The mean Inter-assembler variation 304 ranged from 3.9 (assembly free compared to SPAdes) up to 43.0 SNPs ("SPAdes to megahit"). All 305 combinations, except the "assembly-free to assembly-free" and "SKESA to SKESA", revealed pairwise 306 comparisons with over 20 SNPs for the E. coli dataset. Therefore, only these two methods were used 307 to analyze the complete dataset. 308 Using the assembly free approach, 63/90 (70%) and 21/90 (21%) comparisons show zero or one 309 pairwise SNPs respectively ( Figure 5A). Only for K. pneumoniae, E. faecium, K. oxytoca, and K. 310 aerogenes more than 1 pairwise SNP was observed, with a maximum of 5 SNPs for K. oxytoca. Using 311 the assembly-based approach, in 10/90 (10%) comparisons, zero SNPs were observed among 312 assemblies ( Figure 5B). In 72/90 (80%) of the comparisons less than 5 pairwise SNPs were observed. 313 Interestingly, in the E. aerogenes and K. oxytoca datasets, more than 8 pairwise SNPs were observed. 314 However, on wgMLST no more than 4 alleles difference was observed. On average 3.4 (± 2.6) 315 pairwise SNPs were observed between assemblies of the same isolates (but sequenced in different 316 institutes). Overall, more pairwise SNPs were observed when assemblies were used for SNP analysis 317 compared to screening raw reads for SNPs. Comparing the number of k-mers between the assembly 318 free and assembly-based methods ranged from -2.1% to 1.2% (Supplemental figure 3), indicating that 319 a similar amount of k-mers between the two methods were compared.

Genotyping AMR genes and ORIs 346
Next, Identification of AMR genes and plasmid ORIs was performed. Overall, an excellent 347 reproducibility was achieved, as a precision of 99% was obtained. Most discrepancies could be 348 explained by the variation in the variant calling of a specific resistance gene. Four times in 973 349 genotype calls, there was an unexplained absence of a resistance gene. Although the DNA isolation 350 method used here showed good results for the application of WGS 36 , still some loci could be missed 351 due to inefficient isolation of plasmid DNA where these AMR genes can be located. Only twice and in 352 different institutes, an unexpected ColpVC ORI was found in one of the sequencing datasets, which 353 may indicate contamination during DNA isolation or library preparation. Strauß and co-authors 354 reported a 1.7% discordance caused by WGS compared to micro-array for the detection of resistance 355 and virulence genes 37 . In this study, a similar reproducibility in typing resistance genes and ORIs was 356 obtained and previously described by Kozyreva et al., which found a reproducibility rate of 99.97% 38 . 357

Genetic variation 358
It is of great importance that the genetic distance between technical duplicates does not surpass 359 commonly used thresholds to classify strains into clusters. Here, some variation among the wgMLST 360 allelic profiles was observed, which translated to an average of 0.49 discrepant alleles per 1000 361 alleles compared. Kluytmans-Van den Bergh et al. 21 reported a variation in genetic distance based on 362 wgMLST in a range of 0 to 0.001 (which translates to 5 alleles difference, based on 5000 alleles 363 compared) for five E. coli and three K. pneumoniae which were sequenced in duplicate 21 . This is in 364 concordance with our study, where 88/90 comparisons have no more than 5 alleles difference. 365 Additionally, reported clonal thresholds by these authors were roughly 26 and 2 alleles difference for 366 E. coli and K. pneumoniae on cgMLST respectively. For wgMLST this was 29, 23, 8, 14 alleles 367 difference for E. coli, K. pneumoniae, Citrobacter sp. and Enterobacter cloacae complex respectively. 368 These clonal thresholds are higher by a safe margin than the variation between any of the replicates, 369 sequenced of datasets. Although variation on a genetic level was observed, the level of disparity 370 remained below other thresholds commonly employed for hospital outbreak surveillance purposes 9 . 371 Previous work suggested a cut-off of 10 alleles for MDR E. coli and K. pneumoniae based on 372 cgMLST 39,40 . Therefore, it is safe to assume that if harmonized protocols are used, the genetic 373 difference remains within these previously described thresholds. 374 In the wgSNP analysis, all methods except for the "assembly-free to assembly-free" and "SKESA to 375 SKESA" showed pairwise comparisons with more than 20 SNPs. This indicates that using SPAdes or 376 Megahit in combination with a SNP based method, is unsuitable for outbreak surveillance, as 377 identical strains have more SNPs than commonly used outbreak thresholds 9 , indicating that these 378 strains would be considered not clonally related, thus not belonging to the same outbreak. 379 Furthermore, this also held true when comparing two assembly methods, which implies that 380 comparing bacterial assemblies should be avoided at all costs, if multiple centers employ different 381 methodologies to generate de novo assemblies for WGS outbreak surveillance. Potential outbreak 382 can be missed due to the large number of SNPs detected results in identical isolates not being 383 flagged as clonally related, which subsequently have implications for infection prevention control. 384 For the assembly free method, we observed most replicates to have no SNPs between each other 385 (70%), which is in line with the GenomeTrakr proficiency-test study, which found a similar fraction of 386 datasets showed having no SNPs (73%) 41 . 387 Variation in SNPs among strains showed lower number of SNPs based on the assembly-free method 388 compared to the assembly-based method. It is unlikely that this is caused by different number of k-389 mers that were compared for SNPs, as there was only a modest difference for the number of k-mers 390 compared between the assembly free and the assembly based SNP analysis, which ranged from -391 2.1% to 1.2% difference in compared k-mers (Supplemental figure 3). Therefore, it is more likely that 392 de novo assembly introduces phylogenetic noise in regions difficult to assemble, i.e. around regions 393 such as mobile elements (i.e. transposons and plasmids). Previously described work employing SNP-394 based methodologies to infer phylogeny among bacterial isolates often mask regions in the genome 395 that are sensitive for non-informative SNPs for phylogenetic reconstruction, such as mobile genetic 396 elements (MGE). Masking of these regions requires specialized tools such as Gubbins 42 that are able 397 to recognize regions with elevated numbers of base substitutions in the genome. Unfortunately, 398 using this reference-free methodology, makes this masking impossible to perform in an unbiased and 399 automated fashion, like in the Gubbins pipeline. Therefore, we must assume the possibility of 400 overestimation of SNPs among strains in our study.