Genomic epidemiology reveals multiple introductions and spread of SARS-CoV-2 in the Indian state of Karnataka

Karnataka, a state in south India, reported its first case of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection on March 8, 2020, more than a month after the first case was reported in India. We used a combination of contact tracing and genomic epidemiology to trace the spread of SARS-CoV-2 in the state up until May 21, 2020 (1578 cases). We obtained 91 genomes of SARS-CoV-2 which clustered into seven lineages (Pangolin lineages—A, B, B.1, B.1.80, B.1.1, B.4, and B.6). The lineages in Karnataka were known to be circulating in China, Southeast Asia, Iran, Europe and other parts of India and are likely to have been imported into the state both by international and domestic travel. Our sequences grouped into 17 contact clusters and 24 cases with no known contacts. We found 14 of the 17 contact clusters had a single lineage of the virus, consistent with multiple introductions and most (12/17) were contained within a single district, reflecting local spread. In most of the 17 clusters, the index case (12/17) and spreaders (11/17) were symptomatic. Of the 91 sequences, 47 belonged to the B.6 lineage, including eleven of 24 cases with no known contact, indicating ongoing transmission of this lineage in the state. Genomic epidemiology of SARS-CoV-2 in Karnataka suggests multiple introductions of the virus followed by local transmission in parallel with ongoing viral evolution. This is the first study from India combining genomic data with epidemiological information emphasizing the need for an integrated approach to outbreak response.


Introduction
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), a novel coronavirus that was first detected in individuals with acute pneumonia in China in late 2019, has now spread throughout the world 1

. The World Health Organization (WHO) on March 11 declared the disease Coronavirus
Disease 2019 (COVID19) caused by SARS-CoV-2 a pandemic 2 . COVID19 has claimed over 500,000 lives (as of July 5, 2020) and the pandemic is ongoing 3 .
Genomic epidemiology from the analysis of viral sequences from all over the world is consistent with the emergence of the virus in late 2019 in China and consequent spread and expansion in Europe, and other parts of the world 4,5 . More than 50,000 complete genomes of SAR-COV-2 from all over the world are currently available in public databases such as the GISAID initiative (originally known as Global Initiative on Sharing All Influenza Data) 6 . While this information is invaluable for understanding evolution of the virus 7 , pathogenesis, and design of diagnostic tools, few studies have been able to combine this with epidemiological data to derive insights on viral spread. These studies have provided information on how the virus is introduced and spread in a population.
A comprehensive study of circulating variants of the virus in Iceland, which included over 580 complete genomes in combination with epidemiological information (travel history and contact tracing) revealed that while the initial importation of the virus was from China and Southeast Asia subsequent importations were from different parts of Europe 8 . Studies based on complete SARS-CoV-2 genomes from Guangdong Province in China highlighted the initial importation of virus to the province by travel and limited local transmission 9 . A study of viral genomes from the east coast of the USA combined with travel data revealed the coast-to-coast spread of virus within the country 10 . Initial studies in Washington State uncovered cryptic local transmission 11 . These studies reiterate the importance of combining sequencing data with public health information.
The first case of COVID19 in India was detected on January 30, 2020 and case numbers have continued to rise inspite of stringent interventions including nationwide lockdowns. In the first few months of the outbreak, between January 22-April 30, 2020, test results from all over India could be averaged to a positivity rate of 3.9% 12 . Analysis of these cases revealed that the test positivity rate was highest when the samples were from contacts of a known COVID19 positive case 12 .
A large number of SARS-CoV-2 genomes (about 1500 complete genomes as of July 5, source-GISAID) have been sequenced in different parts of India. The first sequences from India were reported from individuals with travel history to China, Italy and Iran 13,14 .
An analysis of 361 complete genome sequences from India showed that five global clades were circulating in India -old Nexstrain clades B, B4, A2a, A3, and a distinct clade A3i 15 . The A2a (European) clade was found to be the most prevalent, followed by A3i 15,16 . While these studies have added valuable information on circulating lineages of SARS-CoV-2 in India, they have not comprehensively linked genomic data with epidemiological information. This study was therefore undertaken to dissect the molecular epidemiology of SARS-CoV-2 in Karnataka, a state in South India. Here we report 47 full-length SARS-CoV-2 genome sequences obtained from individuals who tested positive for the virus by RT-PCR and present an analysis of epidemiological information combined with genomic data to elucidate the introduction and spread of the virus in the state. Samples with Ct value > 30 were included when they were considered critical for representing a cluster or if they were from symptomatic individuals.

Methods
We used a tiling primer based approach for whole genome sequencing described by the ARTIC Network using Primal Scheme 9,18 . Briefly, we used the V3 primer set-these are 96 pairs with amplicons of about 400 basepairs (bp) spanning the whole genome except 31bp of the 5' and a part of the 3'UTR. PCR was performed by pooling adjacent/overlapping primers into different pools so as to prevent preferential amplification of short fragments between adjacent primer pairs. Primers were initially used at a concentration of 10μM as per the protocol, then modified to amplify regions that were missed by increasing primer concentrations to 50μM. For four regions additional primers were designed and a separate reaction was set up before the pooling step in order to complete the genome. The resulting PCR amplicons were used for preparing libraries for 3 All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020.

Analysis of sequencing data
Sequences were basecalled and demultiplexed using guppy (v3.6), read lengths between 100-600bp were considered for further analysis. Primers were removed from the sequencing reads by trimming 25bp at the ends and additional trimming based on alignment using BBDuk (v1).
Resulting reads were mapped to the RefSeq strain (NC_045512) using Minimap2 (v1.17) within Geneious Prime (Geneious Prime 2020.0.3). A consensus was created for regions with coverage >10x using the 0% majority rule and corrected. Consensus was aligned to the reference to ensure the correct reading frame and annotation was transferred from the reference. Sequences were deposited into the GISAID database.
Phylogenetic analysis, lineage assignment, detection of SNPs and amino acid replacements Consensus sequences from the 47 genomes from this study were aligned with the reference genome using MUSCLE (v 3.8.425) 19 . The multiple sequence alignment was used to infer the phylogeny using iqtree (v 1.6.12) 20 . A total of 88 DNA models were tested, and the TPM2u+F+I model was selected based on the Bayesian Information Criterion. Maximum likelihood tree was constructed as the consensus tree from 1000 bootstraps, using the reference sequence (NC_045512) as the outgroup. Nodes with bootstrap values > 83% were used for interpretation.
Time scaled phylogenies were constructed using TreeTime with the multiple sequence alignment described above and the date of sampling as dates. Pangolin lineage assignments were performed using the online tool 21 . Single nucleotide polymorphisms (SNPs) and amino acid replacements were detected using the CoV-Glue web application 22 . Both tools use sequences submitted to the GISAID database 6 .

Analysis of epidemiological data and contact map
The epidemiological data was extracted from the line list and a contact map was constructed using the state line list of positive cases. We identified primary and secondary contacts for a patient from the line list. We then built a graph where each node is a positive individual and is connected by an edge with their contacts who were positive. This gives us the contact map. The graphs were then filtered by size of clusters or clusters containing a node with a particular property to focus on clusters of interest. The graphs are visualized using the d3.js 23   All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020. .

Results
Karnataka recorded 1578 cases between March 5-May 21, 2020. Most of these cases were from six high burden districts, with Bangalore Urban (the district encompassing the city of Bengaluru) reporting 256 cases ( Figure 1). In total 369 (23.38%) positives were recorded at our centre, of which 60 samples were taken for sequencing ( Figure 1, Table 1). The features of positive cases in the state of Karnataka (state), and those tested and sequenced at our centre are in Table 1   (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020.  Table 4). Of these 104 clusters, 38 clusters were tested at our centre and 12 of the 38 were included for sequencing. These 12 included 294 people and covered ten large clusters (>5 individuals) from the state (Supplementary Table 4).   Table 4).

6
The location of the clusters was analysed using a contact graph. Clusters (9/12) were limited to a single district excepting clusters C1, C3, C6, and C10 which were spread across districts. Time course of the 12 sequenced clusters and 11 cases with no known contacts indicated that i) lineage A.p7 was introduced recently ii) no new cases from C1 and C4 were detected (as of May 21, 2020) iii) ongoing transmission was evident for lineage B and lineage B.6. (Figure 4, Supplementary   (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020. . https://doi.org/10.1101/2020.07.10.20150045 doi: medRxiv preprint COVID19 by January 30, 2020 and sustained local transmission was observed in multiple states including Delhi, Maharashtra, and Gujarat 12 . SARS-CoV-2 was first detected in the south Indian state of Karnataka on March 8, 2020 and by May 21, 2020 it had spread to 28 out of the 30 districts of the state resulting in 1578 cases. The data from this study using a combination of genomic epidemiology and contact tracing provides evidence for multiple introductions of the virus into the state, with sustained local transmission. We report the circulation of six lineages of SARS-CoV-2 in the state namely- A.p7, B, B.6, B.1, B.1.1, and B.4 (Pangolin lineage nomenclature). Viruses from both lineages are now circulating in different countries of the world 21 .
In this study, 3 of the 47 sequences, belong to lineage A.p7 and were from individuals with travel history to other states within India. This lineage is defined by two SNPs T8782C and C28144T and has been reported from Saudi Arabia, Russia, Turkey, and India 21 . No onward transmission was reported from these three cases, however they indicate continued importation of SARS-CoV-2 into the state emphasizing the need for active surveillance of domestic travel.
Of the lineages in the state, B.1 (related to GISAID clade G, and Nextstrain clade A2a) and B.1.1 (related to GISAID clade GR and Nextstrain clade A2a) are European clades. Both lineages harbour the D614G mutation on the Spike protein. It has been suggested that viruses with this mutation are more infectious and the mutation was present at higher frequency in samples across the world 16,24,25 . Of these two lineages, B.1 was a major contributor to the Italian outbreak 21 . In our study, all seven sequenced samples from a large cluster, C4 (comprising of 35 individuals) belonged to this lineage (Figure 3, Supplementary Table 4). This cluster was restricted to Bengaluru city and no new cases were reported from it between 7-21 May. The index case for this cluster was a patient with SARI. Taken together, these observations suggest a hitherto undiscovered link for this cluster to Europe. One other sequence from an individual with no known contact with a positive case in Mysuru district was also assigned to lineage B.1. However, this sequence clustered separately from all the others in the phylogeny and therefore may represent a separate introduction. 8 All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020. . Using a maximum likelihood based approach we were unable to completely separate B and B.6 as some branches had sequences from both clades (Figure 2). Further, nine of twelve clusters in this study and seven of the eleven cases with no known contact (63.6%) belonged to lineages B/B.6 or both. Lineage B is one of the two clades that were circulating in China in late 2019. Lineage B.6 has earlier been reported from Philippines, UK, North America, Australia, Singapore and has also been reported from other parts of India 15 (Pangolin). The defining mutations for these lineages are similar to that of the A3i clade which has been described as a distinct phylogenetic group in India 15 .
Indeed, upto a third of the cases in multiple states across the country belong to the A3i clade 15 .
These lineages were detected throughout the period in this study and across the state including one domestic traveller who entered the state toward the end of the study period ( Figure 4). In  All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020. . Some studies had initially proposed a link between viral lineages, transmission and disease phenotypes which have not been substantiated by experimental evidence 26 . The analysis of sequences obtained from symptomatic and asymptomatic (at the time of testing) individuals in this study did not reveal any association with a particular lineage. Symptomatic individuals were spread across lineages B.1, B, and B.6 along with asymptomatic individuals (Figure 2, Supplementary   Figure 2). Of the 12 clusters represented in the sequencing data, both the index case and the spreaders were more often symptomatic (Figure 3, Supplementary Table 4). However, sequencing did not reveal any mutations that were specifically associated with clinical state.
Our study had the following limitations -it is a single point analysis and some follow-up data is not available, for instance we do not know if individuals who were asymptomatic at testing later developed symptoms. Further, lineage assignments during an outbreak are dynamic and could change as more data is added and sequencing errors are accounted for. Notwithstanding these limitations, our analysis provides insights about introduction, spread, and establishment of SARS-CoV-2 in Karnataka. Further, we were able to capture both geographic diversity and obtain representation from the ten large contact clusters in the state. This was made possible by linking epidemiological information to genomic data. Integrating such an approach, in real time, into public health measures is essential for an effective outbreak response.
10 All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020. . (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figures and Legends
The copyright holder for this preprint this version posted July 11, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted July 11, 2020. . https://doi.org/10.1101/2020.07.10.20150045 doi: medRxiv preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.