Genome Sequencing of Sewage Detects Regionally Prevalent SARS-CoV-2 Variants

Viral genome sequencing has guided our understanding of the spread and extent of genetic diversity of SARS-CoV-2 during the COVID-19 pandemic. SARS-CoV-2 viral genomes are usually sequenced from nasopharyngeal swabs of individual patients to track viral spread.

Genomic epidemiology, the analysis of viral and microbial genomes in order to make inferences about pathogen evolution, transmission, and spread, has played an important role in improving our understanding of the transmission dynamics of the SARS-CoV-2 pandemic (4). Early in the pandemic, this approach revealed multiple introduction events into California and viral lineages present at different abundances across counties in Northern California (5). Genome sequencing was also used to show that there was unexpectedly frequent community spread of a specific genotype after early introduction in Washington State (6). Genome sequencing in the New York City area identified multiple viral introduction events from Europe (7), and sequencing in the Mission district of San Francisco identified distinct viral strains in a single neighborhood, with transmission between family clusters (8).
Unlike many respiratory viruses, RNA of SARS-CoV-2 and other coronaviruses can be detected in human feces (9)(10)(11). Before the COVID-19 pandemic, members of the Coronaviridae had been previously identified in municipal wastewater through both RT-qPCR and shotgun metagenomic and metatranscriptomic sequencing (12,13). Since the start of the COVID-19 pandemic, wastewater RT-qPCR has quantified the amount of SARS-CoV-2 RNA in sewage to estimate the abundance of the virus across many different municipal regions globally (14)(15)(16)(17)(18)(19)(20)(21)(22). Prior work showed that shotgun wastewater sequencing can provide information about many viruses simultaneously (12,23,24) and enable genome-resolved (25) and phylogenetic analyses (26,27). In one study, a SARS-CoV-2 consensus genome was obtained from sewage via targeted amplification and long-read sequencing, allowing for phylogenetic analysis of the predominant lineage (27). Here, we show that sequencing of viral concentrates and RNA extracted directly from wastewater can identify multiple SARS-CoV-2 genotypes at various abundances known to be present in communities, as well as additional genotypic variants not yet observed in local clinical sequencing efforts.

RESULTS AND DISCUSSION
Metatranscriptomic detection of SARS-CoV-2 and other viruses in wastewater. Twenty-four-hour 1-liter composite samples of raw sewage were collected from wastewater treatment facilities in Alameda and Marin Counties in Northern California between 19 May 2020 and 15 July 2020 (see Table S1 in the supplemental material). We extracted nucleic acids from samples using three methods that enriched for viral particles (ultrafiltration) or total RNA (RNA silica columns or silica milk). SARS-CoV-2 viral RNA was first detected using a RT-qPCR assay (see Materials and Methods) of the N gene and C q values ranged from 29.5 to 36.2, or an estimated ;2 to ;553 genome copies/ml of RNA. From this we estimate that there were 2.8 Â 10 5 genome copies/liter of wastewater on average across our samples (see Table S1). For each sample, 40 to 50 ml of RNA was prepared for sequencing, implying an estimated ;4,438 viral genome copies on average were contained within each sequencing library.
After cDNA synthesis from the total RNA, samples were enriched for a panel of human respiratory viruses using a commercially available oligo-capture approach (Illumina respiratory virus panel; see Materials and Methods) and sequenced on a NextSeq 550 to produce on average 12 million 2 Â 75 bp reads per sample. Reads were mapped to the human genome to estimate the amount of human RNA/DNA in the samples (0.7 to 16% of reads per sample). Sequencing reads were then mapped to a dereplicated set of all eukaryotic viruses contained in the RefSeq database, and stringently filtered to include only high-quality reads matching reference sequences with .97% identity (see Materials and Methods). Viral abundances and SNVs (single nucleotide variants) were then calculated using the metagenomic strain-typing program inStrain v1.12. We detected SARS-CoV-2 at various abundances of sequenced RNA/ DNA (0 to 14%) across samples ( Fig. 1a and b; see also Table S1). Sequencing relative abundance of SARS-CoV-2 was not strongly correlated with RT-qPCR genome copy quantification, likely due to the variability introduced by different extraction methods. Viral enrichment by ultrafiltration achieved higher relative abundances of SARS-CoV-2 RNA, although these experiments were time-intensive and often had lower absolute genome copy number recovery according to RT-qPCR. In addition, we sequenced replicates from one set of samples with rRNA depletion but no viral enrichment. Without enrichment, we were able to only detect fewer than 40 total SARS-CoV-2 read pairs ( Fig. 1c; see also Table S1). While this illustrates the difficulty of detecting specific viruses in wastewater in unenriched sequencing data sets, larger sequencing efforts may overcome this limitation by sequencing more deeply.
Other human viruses identified in the wastewater sequencing included Human bocaviruses 2c and 3 ( Fig. 1a and b), both of which are respiratory viruses sometimes capable of causing gastroenteritis, and are included in the Illumina respiratory virus panel. Bocaviruses have been identified in sewage samples previously (28,29). Picornavirus-like viruses were also detected (Fig. 1c). The most abundant viruses in the data were plant viruses including cucumber green mottle mosaic virus and pepper mild mottle virus (PMMoV) (Fig. 1a). These viruses are known to be highly abundant in human wastewater (30) and have been used as fecal loading controls in wastewater SARS-CoV-2 quantification (19). Near-complete (.95% breadth of coverage) genomes were obtained for SARS-CoV-2, bocavirus 3, PMMoV, and other plant viruses (see Table S2), implying that these viruses were at high enough abundance in the data set for exact genomic analysis.
Recovery of complete and nearly complete SARS-CoV-2 viral genomes from wastewater. Complete consensus viral genomes are required to perform viral lineage tracking for genomic epidemiology. We obtained complete consensus SARS-CoV-2 genomes (breadth of coverage .99%) from 7 of 22 samples (31%), while large-scale patient sequencing efforts have for example obtained genomes for ;80% of samples (31). Only samples with RT-qPCR C T values ,33 (;25 genome copies/ml) yielded complete consensus genomes (Fig. 1d), but we also recovered at least one genome using each of our three extraction methods. The mean depth of coverage for each complete genome ranged from 7Â to 107Â after filtering and removal of PCR duplicates. The consensus genomes from Alameda County, and the one from Marin County, were all within 4-bp differences of each other. These consensus genomes were found to be unlikely to be chimeric, as a BLAST analysis identified SARS-CoV-2 genomes that were 100% identical at all nongapped positions (see Table S3) obtained from patients in northern California. Consensus genomes may represent predominant SARS-CoV-2 lineages in the population in the serviced areas during the summer of 2020. The results demonstrate genomic accuracy for recovery of consensus SARS-CoV-2 genomes so long as sufficient coverage is achieved in metatranscriptomic data sets.
Identification of alternative SARS-CoV-2 variants in wastewater populations recovers locally reported clinical genotypes. While consensus genotypes can describe the predominant genotype of a virus in a metatranscriptome, the strength of wastewater-based sampling and sequencing lies in the ability to identify alternative genotypes in the population being sampled. Using a recently developed pipeline for metagenomic SNV calling (32), we identified putative SNVs that are variable within the viral population sampled in each wastewater sample after read mapping to the SARS-CoV-2 reference genome EPI_ISL_402124 ( Fig. 2a; see also Table S4). Due to the large-scale sequencing efforts of SARS-CoV-2 in patients in both northern California and worldwide, we established that these SNVs had also been detected in genomes from individual patients. Across all samples, 50% of SNVs observed in wastewater samples at .10% frequency were also observed in patient-derived viral genomes from California; 61% were observed in viral genomes from the United States, and 71% were observed in any viral genomes collected worldwide. SNVs that have been observed in California patients had significantly higher allele frequencies in the wastewater samples than those that were not detected in clinical cases (mean, 48 versus 15%, respectively; P , 0.01 [two-sided t test]) (Fig. 2b). This is likely because the more abundant a SNV is in the population, the more likely it is to be sampled in wastewater and in the clinic. Further, several of the same SNVs were observed across samples, and these recurrent SNVs were, on average, 2.3Â more likely to be observed in California or U.S. patientderived genomes than SNVs observed once (Fig. 2c). Taken together, these are strong signals that deeper sequencing of wastewater and combining information across samples better recapitulates true viral genomic variation in the sampled population.
More than 75,000 patient-derived SARS-CoV-2 genomes have been sequenced and deposited into the GISAID database globally, including 2,500 genomes obtained from patients in California. To understand the context of the viral genomic variation we observed within wastewater samples, we used a hypergeometric test to calculate the likelihood of overlap by chance between the set of wastewater variants and the set of variants observed in viruses from patients in a given region. This computes the probability of observing a certain amount of overlap in variants by chance and accounts for the fact that some regions have far more sequenced patient genomes and correspondingly more alleles than others. For example, the probability of the observed overlap between wastewater variants and California clinical variants having occurred by chance was calculated to be P , 10 210 , indicating a high likelihood of nonrandom overlap. By further comparing the probabilities of SNV overlap between patient genotypes and wastewater genotypes at the NextStrain "location" level (corresponding to counties and/or cities), we found the highest likelihood of nonrandom overlap between all wastewater genotypes observed and clinical genotypes from Alameda County (Fig. 2d), the location that the wastewater samples were also derived from.
Identification of potential lineage transmission events previously undetected in local patient-based sequencing at time of sampling. Some clinical SARS-CoV-2 viral strains can be differentiated by more than one SNV. Across the wastewater data set, we observed one pair and one triplet of SNVs that were shared by clinical isolates. The pair and triplet of SNVs each occurred at similar frequencies, supporting their linkage in wastewater genomes ( Fig. 3a and b). In addition to the SNVs that also have been observed clinically in California, there were four SNVs recurrent across wastewater samples that had not been previously observed in California but had been observed elsewhere in the United States (Fig. 3c). Two adjacent SNVs (14222G and 14223C) are associated with a single viral strain that has been often observed in clinical samples in Washington State. Another two SNVs (8083A and 1738T) are not linked, but both have been observed in different clinical genomes of four other states in the United States. Interestingly, these variants appear to have arisen or arrived in the United States only during the month of July, suggesting that they may be detected in clinical samples from California in the near future.
Overall, this study demonstrated that wastewater sequencing can accurately identify genotypes of viral strains that are clinically detected in a region and those not yet detected by clinical sequencing. Another key advantage of this method is that it does not rely on specific PCR primers, which can fail to detect SARS-CoV-2 strains with mutations in the primed sequence (33). With more intensive wastewater sampling, this approach also has the potential to reveal patterns of virus distribution within communities, helping to elucidate the transmission and spread of diseases during epidemics. Perhaps most significantly, the results indicate that wastewater sequencing can detect recent introductions of SARS-CoV-2 genotypes and other disease-causing viruses at a population scale.

MATERIALS AND METHODS
Sample collection and extraction. Twenty-four-hour 1-liter composite samples were collected at four different wastewater interceptors in the San Francisco Bay Area (labeled "Berkeley," "Berkeley Hills," "Oakland," and "Marin," based roughly on the municipal areas each services). The time-weighted composite samples were collected using autosamplers that draw from influent every 15 min into 24-hourly bottles, which were then combined and mixed, and subsamples were taken for analysis. Samples were immediately processed by extraction via three different methods. The first method was ultrafiltration SARS-CoV-2 Genomes Sequenced from Sewage ® January/February 2021 Volume 12 Issue 1 e02703-20 mbio.asm.org 5 with Amicon Ultra-15 100-kDa centrifugal filter units. Wastewater was heat inactivated in a water bath at 60°C for 90 min. Wastewater samples were then filtered on 0.22-mm SteriFlip filter units. While we found that the 0.22-mm filtration step, which was implemented to reduce clogging of the Amicon ultrafilter, did result in a loss of RNA (data not shown), we believe the methods recovered a sufficient quantity of viral RNA to adequately profile their genetic diversity. Amicon filter units were prepared by incubation with 1% bovine serum albumin in 1Â phosphate-buffered saline (PBS) on ice for 1 h and then spun, loaded with 2 ml of PBS, and spun again to rinse. Amicon 100-kDa centrifugal filter units were then loaded with 15 ml of wastewater filtrate (flowthrough) and spun in a swinging-bucket rotor at 4,750 Â g for 30 min at 4°C. Flowthrough was discarded, and amicons were reloaded with sample until all sample volume (40 ml) had been processed. For three samples (see Table S1), we processed more than 40 ml per sample but found that this did not improve the resulting SARS-CoV-2 genome quality in this specific instance. For all Amicon centrifuge-concentrated samples, the final volume of the concentrate was ;250 ml. RNA was then extracted with a Qiagen AllPrep DNA/RNA minikit. The second extraction method, direct RNA extraction with silica columns, began with viral and bacterial lysis of samples with 9.5 g of NaCl per 40 ml of wastewater and filtration on a 5-mm polyvinylidene fluoride (polyvinylidene difluoride) filter. The resulting filtrate (flowthrough) was then loaded onto a Zymo III-P silica spin column via vacuum manifold, and RNA was directly eluted from this column. Details of this protocol are available elsewhere (https://www.protocols.io/view/v-2-direct-wastewater-rna-capture-and-purification -bjr9km96). The third extraction method, "milk of silica," began with sample lysis and filtration, as in the second method. Filtered lysate is bound to free silicon dioxide particulate, eluted from the particulate, and concentrated via isopropanol precipitation. This protocol is also available online (https://www .protocols.io/view/direct-wastewater-rna-extraction-via-the-34-milk-o-biwfkfbn).
RT-qPCR and genome copy quantification. The number of viral genome copies in each sample was determined via probe-based qRT-PCR on an Applied Biosystems QuantStudio 3 real-time PCR system with the Thermo Fisher TaqPath 1-Step RT-qPCR Master Mix or TaqMan Fast Virus 1-Step Master Mix. The primer set and probe were purchased as part of the 2019-CoV RUO kit (IDT), and our quantification used the previously published CDC N1 assay (34). Either 2 or 5 ml of sample was used for each reaction (see Table S1) in a 10-or 20-ml reaction, respectively. Cycling conditions were 25°C for 2 min, 50°C for 15 min, 95°C for 2 min, and 45 cycles of 95°C for 3 s and 55°C for 30 s. A standard curve for absolute quantification of viral genome copies was generated with synthetic RNA standards of the SARS-CoV-2 genome (Twist Biosciences).  Table S1), rRNA depletion was performed, and oligonucleotide capture enriched and unenriched sequencing strategies were compared. The rRNA depletion was done using RiboZero Plus supplemented with a comprehensive "Gut Microbiome" probe set. Libraries were prepared using the Illumina RNA Prep with Enrichment (L) Tagmentation protocol. The rRNA-depleted samples were amplified for 20 cycles. Enrichment was performed using the Illumina Respiratory Virus Oligo Panel.
Metatranscriptomic viral abundances. The abundances of viruses within wastewater were obtained by mapping reads with Bowtie 2 (35) to an index of all viral genomes downloaded from the RefSeq Database (release 201). For abundance calculations, mapped read pairs with MAPQ . 20 and pair percent identity to the reference .95% were retained using inStrain v1.3.2 (32). Duplicate reads were removed with the clumpify.sh dedup command from the BBTools software suite (Bushnell 2014). Only viral genomes with at least 10% breadth of genomic coverage obtained were reported.
SARS-CoV-2 variant analysis. Seven samples with nearly complete SARS-CoV-2 breadth of genomic coverage (.99%) were further investigated for a strain-resolved analysis. SNV calling was performed using inStrain v1.3.2 on all read pairs with .90% average nucleotide identity to the SARS-CoV-2 reference. An absolute minimum of two read pairs supporting a variant allele was required for any SNV to be considered in further analysis. PCR duplicates were removed with the markdup command in the Sambamba package (36). All analysis and SNV locations reported are with respect to the reference genome "hCoV-19/Wuhan/WIV04/2019jEPI_ISL_402124j2019-12-30jChina." Consensus genomes from each sample were created using a custom Python script that required a minimum of three reads supporting each genomic position. A multiple sequence alignment of publicly available SARS-CoV-2 genomes and their metadata were downloaded from the GISAID (3) EpiCov database on 23 August 2020. The multiple sequence alignment was processed with a custom Python script to obtain a list of variants for each genome with respect to the WIV04 reference sequence. We removed from all analyses the genomic positions recommended to be masked from SARS-CoV-2 alignments by https://virological.org/t/masking -strategies-for-sars-cov-2-alignments/480. Hypergeometric distributions were calculated with the stats. hypergeom function in scipy (37) to compare wastewater samples to all clinical data from each NextStrain "location" with at least 20 genomes deposited. The following parameters were used for hypergeometric distribution testing: the total number of SNVs observed across all clinical SARS-CoV-2 genomes, the number of SNVs observed in wastewater, the number of clinical SNVs in a region, and the observed overlap between the two. The reproducible code is available at https://github.com/alexcritschristoph/wastewater _sarscov2.
Data availability. Sequencing data for this project has been released under NCBI BioProject ID PRJNA661613. Processed data, reproducible code, and workflows for the analyses performed are available at https://github.com/alexcritschristoph/wastewater_sarscov2.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
We gratefully acknowledge the originating and submitting laboratories of SARS-CoV-2 genomes in the GISAID EpiCoV database (www.gisaid.org) that were used for our comparisons to clinical samples and in particular the Innovative Genomics Institute SARS-CoV-2 Sequencing Group for Alameda County genomes. We also gratefully acknowledge Vinson Fan for assistance with RT-qPCR and the laboratory of Robert Tjian for sharing materials.
Funding was provided to K.L.N. and J.F.B. by a Rapid Research Response grant from the Innovative Genomics Institute (IGI) and a seed grant from the Center for Information Technology Research in the Interest of Society (CITRIS) at UC Berkeley.