High-throughput sequencing of SARS-CoV-2 in wastewater provides insights into circulating variants

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged from a zoonotic spill-over event and has led to a global pandemic. The public health response has been predominantly informed by surveillance of symptomatic individuals and contact tracing, with quarantine, and other preventive measures have then been applied to mitigate further spread. Non-traditional methods of surveillance such as genomic epidemiology and wastewater-based epidemiology (WBE) have also been leveraged during this pandemic. Genomic epidemiology uses high-throughput sequencing of SARS-CoV-2 genomes to inform local and international transmission events, as well as the diversity of circulating variants. WBE uses wastewater to analyse community spread, as it is known that SARS-CoV-2 is shed through bodily excretions. Since both symptomatic and asymptomatic individuals contribute to wastewater inputs, we hypothesized that the resultant pooled sample of population-wide excreta can provide a more comprehensive picture of SARS-CoV-2 genomic diversity circulating in a community than clinical testing and sequencing alone. In this study, we analysed 91 wastewater samples from 11 states in the USA, where the majority of samples represent Maricopa County, Arizona (USA). With the objective of assessing the viral diversity at a population scale, we undertook a single-nucleotide variant (SNV) analysis on data from 52 samples with >90% SARS-CoV-2 genome coverage of sequence reads, and compared these SNVs with those detected in genomes sequenced from clinical patients. We identified 7973 SNVs, of which 5680 were “novel” SNVs that had not yet been identified in the global clinical-derived data as of 17th June 2020 (the day after our last wastewater sampling date). However, between 17th of June 2020 and 20th November 2020, almost half of the SNVs have since been detected in clinical-derived data. Using the combination of SNVs present in each sample, we identified the more probable lineages present in that sample and compared them to lineages observed in North America prior to our sampling dates. The wastewater-derived SARS-CoV-2 sequence data indicates there were more lineages circulating across the sampled communities than represented in the clinical-derived data. Principal coordinate analyses identified patterns in population structure based on genetic variation within the sequenced samples, with clear trends associated with increased diversity likely due to a higher number of infected individuals relative to the sampling dates. We demonstrate that genetic correlation analysis combined with SNVs analysis using wastewater sampling can provide a comprehensive snapshot of the SARS-CoV-2 genetic population structure circulating within a community, which might not be observed if relying solely on clinical cases.

4 detecting SARS-CoV-2, most rely on the threshold cycle (Ct) values of RT-qPCR assays. Beyond 136 this, two recent studies have sequenced the SARS-CoV-2 genomes recovered from wastewater 137 (Crits-Christoph et al., 2021;Izquierdo Lara et al., 2020). 138 139 Despite the promising success of these prior studies, it is still unclear how well wastewater-based 140 epidemiology can identify the genetic diversity of SARS-CoV-2 in a given population and how this 141 relates to known viral diversity of clinical cases. This is especially important as new lineages are 142 being discovered. For example, the B.1.351 strain in the United Kingdom that contains single-143 nucleotide variants (SNVs) of potential biological significance such as N501Y (in the spike protein) 144 (Rambaut et al., 2020b) and K417N, E484K and N501Y in South Africa (Tegally et al., 2020). To 145 investigate the potential of using wastewater to gain insights into variants of SARS-  circulating in the population, we used a tiling amplicon-based high-throughput sequencing 147 approach to determine SARS-CoV-2 sequences (spanning the genome) in 91 wastewater 148 samples collected from 11 states in the United States (USA) between 7 th April 2020 and 16 th June 149 2020. To further survey the viral diversity circulating within a community and to examine how 150 these relate to sequences from clinical cases, we undertook SNV analysis and beta diversity 151 analyses of SARS-CoV-2 sequences in 52 (>90% coverage) out of the 91 wastewater samples 152 from 10 states. We focused specifically on spatial and temporal trends, and how they compare 153 with clinically-derived data. 154 155 2. Material and methods 156 157 2.1. Sample collection and transport 158 159 Flow-or time-weighted, 24-hr composite samples of untreated wastewater were collected either 160 from the headworks of the wastewater treatment plant, from within the wastewater collection 161 system or at hospital facilities using high frequency automated samplers (Teledyne ISCO, USA) 162 from locations across 11 states in the USA between 7 th April 2020 and 16 th June 2020 (Table 1, 163 Figure 1A, Sup Figure 1). Most samplers had refrigeration capabilities or were supplied with an 164 ice/dry ice blend to keep the interior collection vessel cool. During sample collection, wastewater 165 was thoroughly mixed and transferred to high-density polyethylene sample bottles and placed on 166 ice for transport. The samples were either hand delivered or shipped (next-day/2-day) in insulated 167 shipping containers for subsequent processing and analysis. 168 169 2.2. Wastewater sample processing and RNA extraction 170 171 Aliquots of 150 ml of each composite wastewater sample were filtered through a 0.45 μm 172 polyethersulfone (PES) filter and then subsequently through a 0.2 μm (PES) filter. The filtrate was 173 then concentrated using the Amicon ® Ultra 15 Centrifugal Filter Units (MilliporeSigma, USA) by 174 centrifuging at 4500 rpm for 15 min. For each sample, the process was repeated five times in total 175 using two filter units, and subsequently the concentrates were pooled per sample (from the two 176 filter units). For each sample, a 200 μl aliquot was used to extract total RNA using the RNeasy 177 mini kit (Qiagen, USA). 178 179 6 SARS-CoV-2 genomes were considered novel, however, to be more stringent, variants that were 223 only present in one of the wastewater samples were removed from further analyses. 224 225 2.5. Support for lineages assigned by PANGOLIN 226 227 Each environmental sample was compared against the SARS-CoV-2 genomes available in 228 GISAID (Elbe and Buckland-Merrett, 2017;Shu and McCauley, 2017), an open-access genomic 229 database, to collect a set of clinical genomes whose mutations were supported by the SNVs 230 identified above. To reduce false positives, basal genomes, defined as those with 3 or fewer 231 mutations relative to the reference (MN908947) were excluded. The set of genomes supported 232 by each environmental sample SNV profile were grouped by lineages assigned by PANGOLIN 233 (Rambaut et al., 2020a) and lineages with fewer than 3 genomes were excluded to avoid any 234 misannotations resulting in false positives. PANGOLIN is an online platform that assigns lineages 235 to sequences (Rambaut et al., 2020a)  The 'genotype' of each sample was represented in a four-column matrix. In this matrix, each row 249 corresponds to a position in the reference genome, and the value at each column is the frequency 250 of occurrences for each nucleotide (A, C, G and T). At each genomic position, the Yue & Clayton 251 measure of dissimilarity index (Yue and Clayton, 2005) on the nucleotide frequency of the 252 compared samples was calculated. If the nucleotide frequency at a position of a sample cannot 253 be calculated due to zero depth, the Yue & Clayton measure of dissimilarity index at this position 254 between this sample and any other sample compared is assumed to be zero. The sum of the Yue 255 & Clayton dissimilarity (Yue and Clayton, 2005) of all genomic positions was used as a measure 256 of distance between samples. The distance matrix was constructed by calculating pairwise 257 distances of all samples and was subsequently used for principal coordinates analysis (PCoA) 258 (Gower, 1966). 259 260 3. Results and discussion 261 262 3.1. Sample collection, processing and SARS-CoV-2 RT-qPCR screening 263 Sixty of our 91 samples (66%) were collected in Arizona (9 locations located in Maricopa County,264 Arizona Sup Figure 1), 12 (13%) were collected from 9 locations in Louisville, Kentucky (Sup 265 Figure 1), and 19 (21%) were collected from other states, see Table 1 and Figure 1A for details. 266 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted January 25, 2021.

7
A sample collected in October 2019 in Tempe, Arizona was processed as a negative control . The  267  samples were processed using a virus concentration approach, followed by RNA extraction and  268  screening for the SARS-CoV-2 by RT-qPCR targeting the E gene. A standard curve with SARS-269 CoV-2 synthetic RNA (Twist Bioscience, USA) was used to estimate viral load and to establish 270 the limit of detection. Based on the standard curve we determined a consistent limit detection with 271 a Ct-value of ~34.0. For the samples we analysed, the Ct-values ranged from 26.8 to 36 (Table  272 1, Figure 1B For the 52 samples with >90% genome coverage, SNV analysis was undertaken using the 295 program iVAR (minimum quality of 20 and >20 reads per position) without a frequency threshold 296 in order to detect all variations at a population level. This approach was used because, unlike the 297 case with a clinical sample from a single infected individual, wastewater contains material from a 298 population that inhabits a particular region and therefore represents a collection of SARS-CoV-2 299 variants actively shed by infected individuals within the population.  Figure  306 2A). As expected, mean depth is correlated with the number of SNVs detected in each sample 307 ( Figure 2B), the regression analysis indicates the trend. 308 309 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted January 25, 2021. ;

8
To determine unique variants within the 52 wastewater-derived SARS-CoV-2 sequences, SNVs 310 counted in more than one sample at each site were removed and this resulted in 5680 unique 311 SNVs identified across the genome. Of these, 4372 are non-synonymous and 1308 are 312 synonymous substitutions. Additionally, 246 are nonsense mutations and 64 are in non-coding 313 regions. We highlight that SNV A23403G responsible for the spike protein substitution D614G 314 that is frequently seen in clinical data, although it has not thus far been shown to be under strong 315 positive selection (Volz et al., 2021), was present in all 52 wastewater-derived SARS-CoV-2 316 sequences. From one sample (sample #147, Tempe, Arizona), a new variant A23403T was also 317 identified that results in a D614V substitution in the spike protein, but at very low frequency (Sup. 318 Table 1). 319 320 3.4. Comparative analysis of SARS-CoV-2 SNVs in clinical and wastewater-derived 321 samples during the collection period 322 323 The wastewater-derived SARS-CoV-2 SNVs were compared with substitutions that have been 324 detected in clinical-derived sequences. The first aim was to identify possible "novel" SNVs present 325 in the analysed wastewater samples that had not yet been identified in any of the sequences 326 available in GISAID (Elbe and Buckland-Merrett, 2017;Shu and McCauley, 2017) from clinical 327 samples globally. To accomplish this, we initially undertook an analysis to identify all the detected 328 SNVs in the clinical data available from GISAID up until the 17 th June 2020 (subsequent to the 329 last day of wastewater sampling in this study -16 th June 2020) which on that date consisted of 330 45,836 SARS-Cov-2 genome sequences. A total of 548 novel SNVs were identified in the 52 331 wastewater samples collectively, of these 469 were non-synonymous (not including nonsense 332 mutations) and 79 were synonymous substitutions ( Figure 3). Since we evaluated all variants 333 regardless of frequency, some locations (as expected) had more than one possible variant and 334 are illustrated in Figure 3 and outlined in Sup Table 1 To determine how many SNVs have been identified post wastewater sample collection (16 th June 344 2020), a second SNV comparison was performed with all the clinical-derived sequence data 345 available as of 20 th November 2020 (203,741 SARS-Cov-2 genomes available at GISAID). Based 346 on the analysis of samples during the collection period, SNVs that were not detected in the clinical-347 derived sequence data were considered as novel SNVs. From the 548 SNVs considered as 348 "novel" from the wastewater-derived samples, 263 SNVs have subsequently been identified in 349 clinical-derived samples in the period of 17 th June -20 th November 2020 (Sup Table 1, Figure 3). 350 285 SNVs identified in the wastewater-derived samples with the last sampling date of 16 th June 351 2020 have not been identified in clinical-derived SARS-CoV-2 sequences between then and 20 th 352 November 2020. 353 for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted January 25, 2021. ;

354
It is important to highlight that the detection of these "novel" SNVs does not necessarily indicate 355 they are fixed in SARS-CoV-2 lineages that are actively being transmitted nor is it possible to 356 determine if any of these SNVs are linked within lineages. Nonetheless, the identification of the 357 "novel" SNVs clearly demonstrates the relevance of wastewater-derived SARS-CoV-2 sequence 358 analysis which can provide valuable information on SNVs that are not captured using clinical-359 derived approaches. The wastewater-derived sequence analysis does provide information at a 360 population scale and can allow for rapid detection of clinically relevant / important SNVs. 361 362 3.6. Determination of putative lineages of SARS-CoV-2 in wastewater-derived sequences 363 364 Given that wastewater harbours a collective population of SARS-CoV-2 and therefore likely many 365 variants, it is not ideal to determine consensus sequences and consensus sequences-based 366 phylogeny. Therefore, our first approach was to evaluate which clades in the global phylogeny of 367 clinical-derived sequences are supported by the SNVs present in each sample based on the 368 SARS-CoV-2 lineages assigned by PANGOLIN (Rambaut et al., 2020a). The represented SARS-369 CoV-2 lineages for each wastewater sample that are supported are shown in Figure 4. We 370 determined the time frames for which these lineages were first detected in North American 371 clinical-derived sequences relative to the date each wastewater sample was collected ( Figure  372 4A). 373 374 We also undertook a comprehensive analysis of all the lineages detected in each state in the USA 375 up to November 2020 that were supported by at least one environmental sample, this included 376 the number of clinical-derived SARS-CoV-2 genomes sequenced in each lineage ( Figure 4B). 377 This approach helps to determine whether wastewater-based surveillance for SARS-CoV-2 can 378 provide valuable insights on putative circulating lineages in the wastewater contributing 379 population. Although there are several limitations to the analysis of wastewater-derived SARS-380 CoV-2 sequences, our analysis of SNV-based supported lineages revealed some interesting 381 findings. From the 52 analysed wastewater samples, 15 SARS-CoV-2 lineages assigned by 382 PANGOLIN (Rambaut et al., 2020a) were supported, with lineage B.1.5 being the most prominent 383 for the wastewater-derived sequences. wastewater collection, as of 17 th June 2020 ( Figure 4). These 17 samples were from the states of 389 Arizona, Kentucky and Massachusetts ( Figure 4B). In wastewater-derived sequences from 390 Arizona, which represents the greatest proportion of samples, the observed circulating lineages 391 based on clinical-derived sequences are well represented (Ladner et al., 2020), with an additional 392 nine possible circulating lineages identified. 393 394 Although wastewater-based SARS-CoV-2 sequence analysis does not provide the same level of 395 genome confidence (and thus lineage assignment) as those from clinical samples, the 396 wastewater-derived data can be used to identify possible circulating lineages and assess the 397 for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In Figure 5, we show our PCoA analysis results using nucleotide frequencies to evaluate the viral 410 population diversity within and between samples. SARS-CoV-2 sequences in the samples from 411 the ten states were overall highly diverse, and those with two or more samples from the same 412 state tend to cluster closer together ( Figure 5). The main exceptions are those from Kansas (20 th 413 May 2020 and 27 th May 2020) and Colorado (20 th May 2020 and 28 th May 2020) that do not cluster 414 together, both were collected a week apart, and the locations have an estimated human 415 population size of ~25,900 and ~8,300, respectively. Additionally, the Arizona wastewater SARS-416 CoV-2 sequences are broadly distributed in the PCoA plot which is likely a consequence of the 417 large number of samples collected over a three-month period across several sites within Maricopa 418 County, Arizona (Tempe sites, Guadalupe and Gilbert) ( Figure 5A, B and C). In comparison to 419 those in the Arizona wastewater samples, the SARS-CoV-2 sequences in samples from Louisville 420 (Kentucky) are much more tightly clustered in the PCoA plot despite sampling from several 421 locations in the city over a two-month period ( Figure 5A). Despite the large number of samples 422 collected in Arizona compared to Kentucky, and the other states, if seven individual samples were 423 to be randomly picked from each location over the same period as those from Kentucky the SARS-424 CoV-2 genetic distance between them would still be apparently higher for Arizona. We 425 hypothesize that one contributing factor to the differences in viral diversity present in these two 426 areas i.e. Maricopa County Arizona and Louisville (Kentucky), is that, Tempe (the region where 427 the majority of the samples were collected) is home to one of the largest universities in the USA, 428 Maricopa County is the 4 th most populous county in the USA with ~4.4 million inhabitants 429 (Maricopa County 2020) and a major travel hub with an international airport. 430 431 The highest number of samples collected within a state both temporally and spatially for this study 432 was in Arizona. In Arizona, we note that the wastewater-derived SARS-CoV-2 sequences in 433 samples from the same locations do not necessarily cluster together in the PCoA plot ( Figure 5C). 434 Nonetheless, there are clear shifts in the SARS-CoV-2 sequence variants in each location over 435 time ( Figure 5B and C). This is most evident for the Town of Guadalupe ( This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted January 25, 2021.

11
CoV-2 sequences in the samples from the same location at closer timepoints are often more likely 442 to be similar, yet there are exceptions such as the samples from site TP04 (Tempe, Arizona) that 443 have no resident population ( Figure 5B and C Wastewater-based analysis is rapidly becoming a useful platform for investigating the 458 epidemiology of viruses shed in human excretions (Farkas et al., 2018;Farkas et al., 2020;459 Tambini et al., 1993). In this study, we analyse HTS data of wastewater-derived SARS-CoV-2 460 sequences to determine SNVs, putative circulating lineages and also population structure at a 461 spatial and temporal scale. Analysis of wastewater-derived SARS-CoV-2 sequences from 10 462 states ( Arizona and Kentucky where we had undertaken temporal and spatial sampling, some 477 PANGOLIN that had been detected in SARS-CoV-2 clinical-sequence data were also supported 478 in the wastewater in addition to several other putative lineages which may have been missed by 479 clinical sampling (Figure 4). In conjunction with diversity analyses using distance matrices ( Figure  480 5) this shows trends in viral populations which can help to track the spread of the SARS-CoV-2. 481 482 This study supports the use of wastewater sampling as a tool suitable for analysing the genomics 483 of ongoing outbreaks of infectious diseases, such as SARS-CoV-2. As demonstrated here, HTS 484 of RNA from wastewater can provide novel information on SNVs and lineages which, when 485 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.   This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted January 25, 2021. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

B.
A.
G u a d a lu p e T P 0 1 T P 0 2 T P 0 3 T P 0 4 T P 0 5 G 2 R u r a l T P 0 6 Figure 4: Publicly available genomes from clinically derived data deposited in GISAID, grouped by PANGO-LIN, whose mutations were consistent with those observed in wastewater samples. A. Heatmap showing the number of days between sample collection and when supported lineages were first observed in clinical data. Each wastewater sample (52 samples across 10 states) contained support for different clinical samples which are grouped here by PANGOLIN, some of which have only been observed outside North America (indicated as "global only"). B. Clinical genomes reported in USA states and territories which were assigned to PANGOLIN supported by at least one environmental sample. Black borders indicate lineages supported in environmental samples from the respective location.
for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted January 25, 2021. ; https://doi.org/10.1101/2021.01.22.21250320 doi: medRxiv preprint   Figure 5: Principal coordinate analysis (PCA) of SARS-CoV-2 sequence data derived from wastewater samples. A. Distribution of sequences from samples collected in ten states (each represented by a different colour) in the USA showing pairwise distance based on genomic composition between viral populations present in each sample. B. Timeline representation (shown by the colour gradient) of samples taken from the sample locations across ten USA states between April-June 2020 with pairwise distance based on genomic composition between viral populations present in each sample. C. Spatial representation of SARS-CoV-2 sequences from samples collected from various regions within Arizona (represented by different symbols) comparative to those from other states. D. Sampling catchments in Tempe, Guadalupe and Gilbert, Arizona.
for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.