Genomic epidemiology of the early stages of SARS-CoV-2 outbreak in Russia

The ongoing pandemic of SARS-CoV-2 presents novel challenges and opportunities for the use of phylogenetics to understand and control its spread. Here, we analyze the emergence of SARS-CoV-2 in Russia in March and April 2020. Combining phylogeographic analysis with travel history data, we estimate that the sampled viral diversity has originated from 67 closely timed introductions into Russia, mostly in late February to early March. All but one of these introductions came from non-Chinese sources, suggesting that border closure with China has helped delay establishment of SARS-CoV-2 in Russia. These introductions resulted in at least 9 distinct Russian lineages corresponding to domestic transmission. A notable transmission cluster corresponded to a nosocomial outbreak at the Vreden hospital in Saint Petersburg; phylodynamic analysis of this cluster reveals multiple (2-4) introductions each giving rise to a large number of cases, with a high initial effective reproduction number of 3.7 (2.5-5.0).

2 with travel history data, we estimate that the sampled viral diversity has originated from 67 closely timed introductions into Russia, mostly in late February to early March. All but one of these introductions came from non-Chinese sources, suggesting that border closure with China has helped delay establishment of SARS-CoV-2 in Russia. These introductions resulted in at least 9 distinct Russian lineages corresponding to domestic transmission. A notable transmission cluster corresponded to a nosocomial outbreak at the Vreden hospital in Saint Petersburg; phylodynamic analysis of this cluster reveals multiple (2)(3)(4) introductions each giving rise to a large number of cases, with a high initial effective reproduction number of 3.7 (2.5-5.0).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Introduction 4 18, entrance into Russia for all non-Russian citizens for non-emergency reasons was banned 27 . While inbound flights, mainly returning Russian citizens from abroad, are still operating as of early July, passenger traffic has decreased drastically (e.g., 20-fold at the Moscow Sheremetyevo airport, the one that accepts most international flights during the pandemic 28,29 ).
Here, we report an analysis of 211 SARS-CoV-2 complete genome sequences obtained in Russia between March 11 (when there were just 28 confirmed cases Russia-wide) and April 23 (when there were 62773 confirmed cases) 30,31 . Phylogenetic analysis reveals distinct introduced lineages associated with transmission within Russia, as well as multiple individual samples phylogenetically intertwined with non-Russian sequences. The largest identified lineage corresponds to an outbreak at the Vreden hospital; phylodynamics analysis of this outbreak reveals between 2 and 4 distinct introductions and initial rapid spread curbed by subsequent establishment of quarantine.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Sampling and data acquisition
Samples were obtained from hospitals and out-patient clinics as part of  surveillance and sequenced at the Smoroditsev Research Institute of Influenza. We covered. Therefore, this dataset is representative of the early outbreak in Russia in terms of geographic spread (Fig. 1a). For phylogenetic context, we also used the 19623 wholelength, high-quality GISAID genomes from the rest of the world available on May 26, 2020.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Multiple origins of SARS-CoV-2 in Russia
Phylogenetic analysis indicates that the Russian samples are scattered across the SARS-CoV-2 evolutionary tree, representing much of its global diversity ( Supplementary Fig. 1).
Most samples correspond to the B.1, B.1.1 and B.1.* lineages (PANGOLIN nomenclature 33 ) or clade G, GR and GH (GISAID nomenclature 34 ) which are wide-spread in Europe (Fig. 2 We aimed to identify distinct introductions of SARS-CoV-2 into Russia. Phylogenetically, each of the 211 Russian sequences belongs to one of the three categories ( Supplementary   Fig. 2). Firstly, 77 (36%) of these sequences form the 9 distinct Russian transmission lineages (Fig. 2, 3), defined as monophyletic groups (clades) carrying more than one sequence all of which are Russian. These lineages indicate within-Russia transmission of introduced variants. Five of these lineages had no predating Russian sequences at their ancestral nodes, indicating that they originated from five distinct introduction events (Fig. 3a, c-d; Supplementary Note).

The remaining four Russian transmission lineages carried both non-Russian and earlier
Russian sequences at their ancestral nodes (Fig. 3b). Such lineages, hereafter referred to as "stem-derived transmission lineages", could also result from distinct introduction events; alternatively, their last common ancestor could already reside in Russia. To estimate the number of introductions giving rise to the stem-derived lineages, we make use of the direct data on travel history (or lack thereof) available for a fraction of our patients. Using a statistical model, we estimate that these lineages together resulted from roughly one additional introduction event (Supplementary Note). This number could be an underestimate . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 8 due to undersampling of diversity outside Russia. Indeed, one of the identified lineages ( Fig.   3b) involves two samples that had travel history to two different countries, indicating likely double introduction within the same lineage.
Secondly, we observe 73 (34%) singletons each possessing their own characteristic mutations not shared by any other sequences (Fig. 4). These include 33 singletons without any predating Russian ancestral sequences, and 40 singletons stemming from ancestral nodes with earlier Russian sequences (hereafter, "stem-derived singletons"). We assume that the former correspond to sole introduced cases, for a total of 40 such introductions.
Most of them had probably not resulted in any within-Russia transmission. However, we find that some of the singleton sequences were sampled from patients without any travel history ( Fig. 4c). This indicated that at least some of the singletons likely correspond to distinct introductions that yielded domestic transmission clusters, of which just one representative was sequenced. Using travel data, we estimate that stem-derived singletons resulted from ~6 additional introduction events (Supplementary Note).
Thirdly, the remaining 61 sequences (29%) fell into 12 sets of identical sequences, further referred to as stem clusters, each of which was also identical to some of the non-Russian sequences (Fig. 4d). Again, individual samples within a stem cluster could correspond to distinct introductions or domestic transmission. When data on travel history is available, we find that some of such clusters include multiple individuals with travel history, suggesting that identical sequences were repeatedly introduced into Russia at least in some instances (Fig.   4d). On the other hand, we also observe individuals without travel history, indicating domestic transmission of these variants. From travel data, the estimated number of introductions leading to stem cluster sequences is ~22.
Overall, we estimate the number of independent transmissions into Russia as ~6 resulting in transmission lineages, ~39 resulting in singletons, and ~22 resulting in stem clusters, for a . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 9 total of 67 events. The uncertainty associated with this estimate is largely dependent on the approach for treating the numbers of introductions leading to stem clusters. If each stem cluster is assumed to originate from exactly one introduction, the estimated number of introductions is 47. If instead each introduction within a stem cluster is distinct, the estimated number of introductions rises to 103.
Phylogenetic analysis indicates that for most Russian transmission lineages, the earliest sample collection dates fall into the range between March 11 and 24, indicating that the corresponding lineages were introduced into Russia not long before (Fig. 5b). Indeed, out of the nine Russian transmission lineages, only two (lineages 6 and 8) had later dates of the oldest sequences. However, those were stem-derived lineages, and the oldest stem sequences corresponding to them dated to March 13, suggesting that these transmission lineages could have also been established by this date. Many (15 out of 33) of the singletons were also collected within this timeframe, although some were collected later (mean date: March 29); together with the fact that many of the singletons have not travelled (Fig. 4, Supplementary Fig. 4), this indicates that they in fact correspond to as yet unsampled transmission lineages. By contrast, most stem-derived singletons were sampled at later dates (mean date: April 7, Mann-Whitney U-test, p=0.014), indicating that they were more likely than non-stem-derived singletons to originate from within-Russian transmission.
By the time introduction into Russia has started, the virus had already spread through other countries, with the same variant frequently present at multiple locations. Therefore, the source of most introductions could not be established unambiguously. Still, for many of the samples, phylogenetic position is informative of the source. For example, the earliest patient with known travel history has returned to Russia from France, and her sample is nested within a clade with just French and Danish sequences at the ancestral node, with French having earlier dates and therefore a more plausible source (Fig. 4a). For two additional sequences corresponding to regional outbreaks, no direct travel data was available but the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint probable source could be established from media reports, and was consistent with the phylogenetic position of the corresponding clades. This was the case for the import of clades from Switzerland into Yakutia (the Sakha Republic) (Fig. 3c) 35,36 and from Saudi Arabia to the Chechen Republic (Fig. 4b) 37 .
Overall, out of the 13 patients with known travel history (11 direct + 2 from media reports), the country of origin is consistent with the sampling locations of the same or ancestral nodes in 9 cases, including the 3 cases when it is uniquely identified. In one case ( Supplementary   Fig. 5b), the travel direction (Egypt) is inconsistent with the phylogenetic position of the sample, and in the remaining three cases, there is not enough phylogeographic data to make a call. For the same 9 out of the 13 patients, we were able to correctly and uniquely identify the source continent (Europe in all cases). The individuals importing the virus and seeding the Russian transmission lineages were not a random sample of the population. Very early samples were collected from patients who were on average younger than those sampled later (Fig. 5a). This is consistent with the major role of younger Russians in the import of virus into Russia 38 , possibly because they comprised a larger share among the people returning from business trips or holidays.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

11
No non-Russian sequences were nested within predominantly Russian clades (Figs. 2, 3, Supplementary Fig. 1). Therefore, we observe no sign of export of SARS-CoV-2 outside of Russia.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 14 no Russian sequences are truncated, with numbers of such sequences shown in brackets.
Sequences from the Vreden hospital and lineages carrying such sequences are marked with star, triangle and diamonds. Branch lengths represent the number of nucleotide substitutions. "hCoV-19/" prefixes are excluded from all sample names for clarity.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 16 The ancestral node of lineage 9 uniquely maps to the USA, and this lineage spans two different regions of Russia. Flags represent individuals with a known history of travel to the corresponding country; the Russian flag shows a known lack of travel history. Diamonds represent samples associated with group 3 of the Vreden hospital outbreak. See Supplementary Fig. 3 for all nine transmission lineages.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Temporal dynamics of SARS-CoV-2 spread in Russia
Following introduction, the virus has spread throughout Russia. Four out of the 9 identified Russian transmission lineages, and 8 out of the 12 stem clusters, span multiple regions (Figs. 1b, 3c-d, 4d). As Moscow and Saint Petersburg are major transport hubs, together responsible for 77% of the international air traffic in Russia, we hypothesized that the virus was introduced through these cities, and spread throughout Russia from them. Contrary to this hypothesis, among the Russian stem clusters and singletons, the samples from Moscow or Saint Petersburg do not sit on shorter branches than samples from other regions; in fact, branches leading to them tend to be slightly longer (0.88 vs. 0.37, p=0.006, permutation test), probably because of more extensive regional sampling early in the outbreak. Thus, we see no evidence for a preferential direction of transmission within Russia, suggesting that the Russian epidemic has been seeded by near-concurrent introduction into multiple regions. Our dataset contains SARS-CoV-2 genomes obtained from 52 of the Vreden hospital patients or medical workers. Phylogenetic analysis indicates that these samples form three . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint distinct groups, each defined by its own set of mutations. The largest group, group 1, includes 41 sequences obtained between April 3 and April 22 and represents a distinct Russian transmission lineage (lineage 8, star in Fig. 2). This lineage derives from a very prolific ancestral node which has seeded multiple lineages throughout the world, including five of the Russian transmission lineages, so its origin cannot be positioned phylogeographically. Group 2 contains 7 out of 9 sequences in another clade, which also carries one non-Russian (English) sequence (triangles in Fig. 2). Finally, group 3 includes 4 sequences and represents a clade of its own within another Russian transmission lineage . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 22 We found that the Bayesian analysis supports at least two distinct introductions of SARS-CoV-2 into the Vreden hospital. This is based on the deep split between group 3 and groups 1-2. The MRCA of all three groups dates to February 4 (95% CI January 1 -March 7). This is almost two months prior to the assumed date of introduction (March 27), implying that group 3 and the remaining Vreden samples were introduced independently.
A third introduction into the Vreden hospital is also highly probable. Indeed, the MRCA of groups 1 and 2 dates to March 15 (95% CI February 25 -March 31). As there was no sign of infection at the hospital before the end of March, it is quite likely that these two groups originated through separate introductions. The root of the group 1 dates to March 23 (95% CI March 11 -March 30), which is consistent with the suspected illness period of the zero patient. Additional evidence that groups 1 and 2 originate from distinct introductions is provided by the fact that the clade that includes group 2 also carries a non-Russian (English) sequence (Fig. 2).
Finally, a fourth introduction is also possible, suggested by the deep split within the group 3 ( Fig. 7). The MRCA of this group 3 dates to March 17, although this estimate has a broad confidence interval which overlaps the supposed illness period of the zero patient.
We estimated the phylodynamic parameters before and after the quarantine measures were introduced. In all three analyzes, the estimates of phylodynamic parameters were stable and consistent with each other. We found that the effective reproductive number Re was 3.72 (95% CI 2.48-5.05) before April 8, and dropped to 1.38 (95% CI 0.48-2.41) after April 8 (Fig.   6). Remarkably, the credible intervals for these estimates do not overlap, indicating a statistically significant slow down of the outbreak. The sampling proportion decreased with time, indicating that the number of sequenced samples did not keep up with the rapidly growing case counts (Fig. 6, Supplementary Table 3).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. By using direct travel data, we show that both these problems hold. Indeed, we find transmission lineages apparently co-introduced from multiple countries (Fig. 3b) or singletons without any history of travel (Fig. 4c). The uncertainty in the number of introduction events is the highest for identical sequences with broad geographic distribution, e.g., the last common ancestor of lineage B.1.1. This node constitutes a stem cluster of 100 identical Russian sequences, as well as 4323 sequences from outside Russia. It is the immediate ancestor to five Russian transmission lineages and 19 stem-derived Russian singletons, so how many times this sequence has been introduced into Russia strongly affects the overall counts of the number of introductions. Travel data indicate that a stem . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 26 group can carry a combination of multiple introduced and domestically transmitted sequences, complicating the inference of the number of introductions.
Under a simple statistical model combining genetic and available travel data, we estimate that the sampled diversity of SARS-CoV-2 in Russia originated from 67 introductions. Since this corresponds to roughly one introduction per each three sequences sampled, the actual number of introductions was probably much higher, and its estimate will likely increase as more sequences are sampled.
Contrary to some previous reports 47, 53 , we find that the phylogeographic position of a lineage is predictive of its origin. For some of the Russian samples, we are able to infer unambiguously or with little ambiguity their origin, and when direct travel data is available, it supports our claims. The difference from the UK study may be due to the fact that most Russian lineges originated late, when the source European lineages were already well established. Still, for many of the introductions, phylogeography is not informative of their origin; moreover, phylogeographic inferences are expected to be biased in the presence of uneven sampling between countries. This illustrates the need of combining multiple data sources, including direct travel data, for understanding viral origin and spread 54 .
Although it is hard to ascribe epidemiological results to specific NPIs, our analysis suggests that the border closure with China implemented in February has effectively curbed the virus introduction into Russia from the Asian direction. Indeed, only four of our samples belong to lineages A, B and B.2 (GISAID clades S, L and V, respectively), which predominantly originated in Asia; and two of those sequences are nested within other European subclades, indicating that the import was through Europe. This fraction is not representative of global case counts at that time, and is instead reflective of travel patterns and history of border closures. It is also in contrast to the situation in other countries where the outbreaks started earlier, and were probably seeded by direct introduction from Asia 55-57 .
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint As a result, the major source of the SARS-CoV-2 introductions into Russia was Europe. We see robust evidence for transmission within Russia, but no evidence of export of SARS-CoV-2 outside Russia. This is consistent with a lower number of international travelers leaving Russia compared to the number returning to Russia after the start of the pandemic; later start of the outbreak compared to that in neighboring countries; and/or higher efficiency of border closure at later stages of the outbreak. More generally, this is consistent with Europe being a global source, rather than a sink, of infections up to early March 58 .
For most of the discovered transmission lineages, the earliest sampled sequence was collected between March 11 and 24 (Fig. 5b). In the larger UK dataset, the mean time between the importation date of a lineage and its earliest sampling date within the UK was estimated to be approximately two weeks, although this depends on many factors including lineage size and sampling intensity 47 . If this can be extrapolated to Russian transmission lineages, this implies that these lineages typically originated from imports in the last week of  64 . In all but one of these cases, the outbreaks were genetically homogenous, indicating that they each arose from a single case. In the community living facility, multiple introductions have occurred, but there was a dominant clade that included nearly all the samples, while . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 28 other clades were rare 46 . By contrast, at the Vreden hospital outbreak, we observe multiple (2)(3)(4) introductions, each of which gave a prolific clade. This indicates that this outbreak could have originated from multiple superspreading events. Furthermore, we estimate the initial effective reproductive number Re during the pre-quarantine period at ~3.7, which is rather high. Multiple superspreading events and the high Re can be due to some of the conditions specific to a hospital not specifically equipped for infection control, including dense contacts (in particular, spread by medical workers), absence of protective measures, and lack of awareness. In the second phase of the outbreak, we observe a significant decrease in Re down to ~1.4. This change can be explained by two factors. Firstly, it can be due to increased awareness and quarantine measures which were in effect after April 7.
Secondly, it can be due to a large number of people already ill, preventing further infection; indeed, around 30% of people at the hospital had been infected by April 22. We cannot quantify the contribution of these factors to the slowing rate of infection spread with available data and methods.
The Vreden hospital outbreak has apparently contributed to COVID-19 spread in Saint Petersburg. By May 26, more than 14,000 cases had been confirmed in Saint Petersburg. In our dataset, the three predominantly-Vreden groups include five non-Vreden samples (out of the 84 non-Vreden Saint Petersburg samples; Fig. 2 and Supplementary Fig. 3), indicating transmission of SARS-CoV-2 from the hospital to the outside population. For two of these samples, 6840 and 6846, their relation to the Vreden hospital could be identified: these two cases were probably infected by family members of a Vreden hospital employee. As we didn't specifically target Vreden-related samples collected outside of the Vreden hospital, the high proportion of Vreden-related samples (5/84) implies that the Vreden hospital outbreak has conceivably contributed to a substantial fraction of non-Vreden cases in Saint Petersburg.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Sample collection and sequencing
Nasopharyngeal and/or throat swabs were collected in virus transport media. Total RNA was extracted using RiboPrep DNA/RNA extraction kit (AmpliSens, Russia). Extracted RNA was immediately tested for SARS-CoV-2 using LightMix ® SarbecoV E-gene plus EAV control (TIB Molbiol, Berlin, Germany) provided by the WHO Regional Office for Europe and based supplemented with GlutaMax (Gibco), Sodium Pyruvate (Gibco) and 10% FBS (Gibco #10500). 2 days before inoculation cells were seeded in 5.5 cm2 cell culture tubes (Nunc) at 1:4 ratio and 5% FBS. Samples were diluted 1:10 with serum free media containing antibiotic-antimycotic (Gibco) and inoculated to cells in a volume 0.5 ml/tube. After incubation for 2 h at 37C, inoculum was removed and 3 ml of serum free media with anti-anti was added to tubes. Viruses were harvested 4-6 days post inoculation (p.i.) when cytopathic effect (CPE) was near 80-100%, while first signs of CPE were typically observed 2-4 days . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020.  For each sample, we then mapped reads onto the Wuhan-Hu-1 SARS-CoV-2 genome sequence (NCBI ID: MN908947.3) using minimap2 69 with default settings and filtered out chimeric reads and reads that had secondary alignments. SAMtools-mpileup 70 was used to produce draft consensus sequences which were then corrected as follows. Mappings were converted into .tsv files using sam2tsv 71 , and for each position in the genome, we computed . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 31 the frequencies of all variants present. We further considered positions with coverage 15 or higher and alternative (compared to Wuhan-Hu-1) variant frequency 50% or higher. We corrected the draft consensus sequences based on the defined set of alternative variants.
Each introduced correction was assessed by visually analyzing the corresponding region of mapped reads in IGV 72 . Additionally, we manually assessed all alternative variants that had coverage 100 and below. We observed several spurious mutations that were not included in final consensus sequences, including the homoplasic mutation G11083T residing at the end of the poly-T tract in the genome that was observed in five of our samples. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Phylogenetic analysis
We categorized each Russian sequence into one of the five phylogenetic categories based on its phylogenetic position, defined as follows. A Russian transmission lineage is a set of two or more sequences that form a Russian-only clade. A Russian singleton is a single Russian sequence that forms a clade of its own (i.e., possesses one or more private mutations) and is not a part of a Russian transmission lineage. A Russian stem cluster is a set of Russian sequences identical to each other and to some non-Russian sequences. A Russian stem-derived transmission lineage is a Russian transmission lineage whose immediate ancestor is a Russian stem cluster, and at least one of the sequences in this stem cluster was collected earlier than the earliest sequence in the lineage. A Russian stemderived singleton is a Russian singleton whose immediate ancestor is a Russian stem cluster, and at least one of the sequences in this stem cluster was collected earlier than the singleton. These categories are schematically represented on Supplementary Figure 2.
As sampling dates, we used the collection dates reported in GISAID. For some samples, either the day or both the day and the month of collection were missing. In such cases, the date was set to the latest date possible, e.g. "2020-03" to "2020-03-31" and "2020" to "2020-12-31". The date for the lineage introduction was estimated as the earliest collection date among all samples in the particular lineage.
In phylogeographic analysis, we assumed that the possible source(-s) of introduction for Russian transmission lineages and Russian singletons were the sampling country(-ies) of the non-Russian sequences ancestral to the considered lineage; for Russian stem-derived transmission lineages and Russian stem-derived singletons, these were the non-Russian . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint 33 sequences identical to the ancestral stem cluster. For stem clusters, the possible source of introduction were the countries of origin for the sequences identical to those in the stem cluster. As a possible source of introduction, we only considered those countries with the earliest collection date earlier than the earliest collection date among all samples in the lineage (or than the collection date of the singleton). For patients with known travel history, we considered the country (continent) of origin as uniquely identified by phylogeography if it was either the only country (continent) on the ancestral stem, or the one with the earliest collection date. For samples with no travel data, the same logic was applied, except we only infer the country or continent of origin if it was the only one on the stem. If there were more than eight countries in the list, countries were merged into regions: Africa, Asia, Europe, North America and South America. To study the possibility of introduction from China, we performed the same analysis, but considered China, Hong Kong and Taiwan separately from the rest of Asia.
To estimate the number of possible exports out of Russia, we used maximum parsimony, asking whether any non-Russian sequences were nested within Russian clades. In this analysis, we conservatively assumed that the phylogenetic nodes carrying both Russian and non-Russian sequences were positioned outside Russia. No evidence for export was observed.
To understand whether introductions to Russia occurred through major transportation hubs (Moscow and Saint Petersburg), we considered all Russian samples not included in Russian transmission lineages. For these samples, we calculated the branch lengths from each sample to its immediate ancestor, and labeled all samples by two categories: major hubs (Moscow, Moscow region, Saint Petersburg and Leningrad region) and other locations (samples from all other locations in Russia). On this data, we performed a permutation test, shuffling labels across the dataset 1000 times. The two-sided p-value was calculated.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Phylodynamics of SARS-CoV-2 in Vreden hospital
As discussed in the Results, the Vreden samples belong to three distinct phylogenetic groups. To account for this, we constrained the phylogeny as follows: (((group1), group 2),(group 3)). We independently ran BEAST2 on three datasets: (i) the whole Vreden dataset comprising groups 1, 2 and 3; (ii) groups 1 and 2; and (iii) group 1 only. The model details were as follows. The effective reproductive number was allowed to change on March 27 (which delimits the suspected out-of-hospital period) and again on April 8 (which corresponds to the introduction of quarantine). The sampling proportion was set to zero prior to March 27, and was allowed to change on April 8 and April 15. The prior on the clock rate was set to be a normal distribution with mean 9.41*10 -4 and standard deviation 4.99*10 -5 , based on the estimates from the UK study 47 . Dates were fixed for the 11 sequences (all from group 1) for which the dates of symptom onset were available; for the remaining 41 sequences, we applied tip date sampling with a uniform prior as detailed below. Other priors are provided in Supplementary Table 7. Supplementary Tables 4, 5 and 6 contain the Bayesian estimates of the model parameters for three datasets comprising groups 1, 2 and 3, groups 1 and 2 and group 1, respectively.
The fifty two Vreden samples were collected on 5 distinct dates, with a substantial lag between subsequent collection dates (see Supplementary Table 3). For some of the samples, sample collection date could differ substantially from the symptoms onset date. To address the bias in collection dates, we used the symptoms onset date instead of the collection date in BEAST2 analysis, estimating it as follows. For 11 samples, the symptoms onset dates were known (Supplementary Table 2). For each of the remaining 41 samples, we produced a posterior estimate of its symptoms onset date by using a uniform prior between March 31st and the collection date.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint

Public information and data visualization
The initial Russian map was downloaded from GADM (sf, level 1) 76 . Numbers of confirmed cases in Russia by region were downloaded on May, 26, 2020 from ref. 77 . Patients age data for Russian samples were extracted from GISAID metadata; for 12 samples, age data were missing. The Spearman correlation between age and collection date was calculated in R

Code availability
Python/R scripts used for data visualization are available upon request.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. 11. In Russian. Принято решение о временном ограничении въезда иностранных . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20150979 doi: medRxiv preprint