Three adjacent nucleotide changes spanning two residues in SARS-CoV-2 nucleoprotein: possible homologous recombination from the transcription- regulating sequence Running title: New motif within SARS-CoV-2 nucleocapsid

The COVID-19 pandemic is caused by the single-stranded RNA virus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a virus of zoonotic origin that was first detected in Wuhan, China in December 2019. There is evidence that homologous recombination contributed to this cross-species transmission. Since that time the virus has demonstrated a high propensity for human-to-human transmission. Here we report two newly identified adjacent amino acid polymorphisms in the nucleocapsid at positions 203 and 204 (R203K/G204R) due to three adjacent nucleotide changes across the two codons (i.e. AGG GGA to AAA CGA). This new strain within the LGG clade may have arisen by a form of homologous recombination from the core sequence (CS-B) of the transcription-regulating sequences of SASCoV-2 itself and has rapidly increased to approximately one third of reported sequences from Europe during the month of March 2020. We note that these polymorphisms are predicted to reduce the binding of an overlying putative HLAC*07-restricted epitope and that HLA-C*07 is prevalent in Caucasians being carried by >40% of the population. The findings suggest that homologous recombination may have occurred since its introduction into humans and be a mechanism for increased viral fitness and adaptation of SARS-CoV-2 to human populations.


Background
Evidence of viral adaptation to selective pressures as it spreads among diverse human populations has implications for the ongoing potential for changes in viral fitness over time, which in turn may impact transmissibility, disease pathogenesis and immunogenicity. Geographic differences in viral sequence diversity and epidemiological profiles of disease are likely to reflect the spread of founder viruses, which first entered different SARS-CoV-2 naïve populations. However, the extent to which selection pressures operating within those populations also impact SARS-CoV-2 diversity is currently not known. Functional effects of new genetic changes need to be considered in ongoing public health measures to contain infection around the world and in the development of universal vaccines and antiviral therapy. Here we describe a new emerging strain of SARS-CoV-2 within the LGG clade that appears to be the result of a homologous recombination event that introduced three adjacent nucleotide changes spanning two residues of the nucleocapsid protein. That strain expanded rapidly in Europe in March 2020. This protein forms an integral part of the virus lifecycle and is known to be highly immunogenic.

Newly emerging strain from Europe with linked variations in the nucleocapsid
We utilized publicly available SARS-CoV-2 sequences from the GISAID database  Table 1). Of these polymorphisms, three were the polymorphisms L84S in ORF8, D614G in surface glycoprotein (S) and G251V in NS3 (ORF3a) that mark the major worldwide clades S, G and V, respectively. Two newly identified adjacent polymorphisms (R203K and G204R) in the nucleocapsid protein occur in approximately 13.4% of deposited strains and form one of the main strains emerging from Europe ( Figure 1A). Other common polymorphisms include Q57H in NS3, T85I in NSP2, L37F in NSP6, P323L in the RNA-dependent RNA polymerase, T175M in the membrane glycoprotein and P504L and Y541C in the helicase. Current low frequency polymorphisms at <5% of deposited SARS-CoV-2 sequences include S193I in the nucleocapsid, H93Y in NS3, and the following polymorphisms V378I, G392D, I739V, P765S, A876T, F3071Y, G3278S and K3353R in ORF1ab (Supplementary Table 1).
The polymorphisms are present in strains sequenced using different next generation sequencing (NGS) platforms (e.g. nanopore, Illumina) and the Sanger-based sequencing method making it unlikely that the new changes are sequence or alignment errors. In addition, different laboratories around the world have deposited sequences with these polymorphisms in the database and examination of individual sequences in the region does not find obvious insertions/deletions likely representing alignment issues or homopolymer slippage.

Adjacent amino acid polymorphisms due to three adjacent nucleotide changes in the nucleocapsid
For the two newly identified adjacent polymorphisms in the nucleocapsid at positions 203 and 204, there were no strains in the database that had only one of the two changes. The SARS-CoV-2 sequences deposited into the GISAID database are consensus strains predominantly generated from NGS platforms that can typically identify low frequency variants. We did not have access to the original sequence files from the contributing laboratories in order to assess if there was evidence of strains that harbored only one of the polymorphisms at lower frequencies. However, no circulating strain has so far been captured that contains only one of the two nucleocapsid polymorphisms as the consensus sequence.
The rapid emergence of these closely linked polymorphisms in viruses may reflect strong selection pressure on this region of the genome in which the original mutation incurred a replicative capacity, or other fitness cost, which could be restored by a linked compensatory mutation. Evidence for such adaptations with closely linked compensatory mutations are known to occur under host immune pressure as is well established for other adaptable RNA viruses such as HIV 1,2 and Hepatitis C virus (HCV) 3 . These viruses have such a high rate of viral replication and error-prone reverse transciptase that a massive swarm of viral variants with ongoing recombination between residues is generated continuously. As a result selection pressure exerted by immune responses or other selective pressures effectively operate on each separate residue independently 4 . In contrast, coronaviruses encode proofreading machinery and have a propensity to adapt by homologous recombination between viruses rather than classic step-wise individual mutations driven by selective pressures operating on single viral residues. This, together with the routine nature of their cross-species transmission 5 , led Graham and Baric 6 to presciently warn in 2010 that it was a matter of when, rather than if, a pathogenic coronavirus pandemic would occur in humans. Also of note, the phenomena of compensatory fixation has been described in the area of HIV antiviral resistance in which the linked mutations cannot revert to wild type when the selective pressure is removed as the virus cannot negotiate the fitness valley to return to its previous optimal state 7 . We therefore predict that the K203/R204 (AAA CGA) change is likely to remain fixed and intermediates to the wild type are unlikely to be found. It will be critical to determine if the introduction of the AAACGA motif results in a replicative or other fitness cost to the virus, creates an alternative subgenomic mRNA transcript or RNA secondary structure or increases nucleocapsid activity as this could indicate that there may be viral attenuation as passage occurs globally through populations of diverse immunogenetic background.
As further evidence of the likelihood of a homologous recombination event, the R203K polymorphism involves a two-step process from AGG to AAA. However, strikingly, the position shows no evidence to date of alternative codon usage, all viral strains that contain an R at this position have the AGG codon, and similarly those As of March 31, 2020 there appears to be only a small proportion of strains with KR- LGG in the US, likely reflecting that deposited sequences have been mainly from the West coast of the US that experienced initial importation of Asian strains of SARS-CoV-2. It will be of great interest to see sequences from the East coast of the US given the early importation of SARS-CoV-2 from Northern Europe as well as Asia and the widespread community transmission that has followed ( Figure 1A).
Interestingly, the M175 polymorphism in the membrane glycoprotein appears to only be present on the KR-LGG combination (of the 132 sequences with this polymorphism 131 are from Europe and 1 from North America) (Supplementary Table 2). When the other common polymorphisms (>5%) observed in the NSP2, NSP6, RNA-dependent RNA polymerase (RdRP), membrane glycoprotein and helicase are taken into account, there are at present eight main circulating strains at >5% frequency in the database all within one to three amino acid polymorphism networks (Supplementary Table 2).
Of note, our current knowledge of the global circulating strains is dependent on the ability of laboratories in different countries to deposit full genome length SARS-CoV-2 sequences and may be subject to ascertainment bias. As such, the frequencies of specific strains shown in Figure 1 may not reflect the size of the outbreak. However, the data does provide the opportunity to predict the presence of specific strains in areas given the known epidemiology within different countries and regions.

SARS-CoV-2 and Host Adaptation: Implications for global viral dynamics, pathogenesis and immunogenicity
Currently the possible functional effect(s) of the introduction of the AAACGA motif into the nucleocapsid are not known. The nucleocapsid protein is a key structural protein critical to viral transcription and assembly, suggesting that changes in this protein could either increase or decrease replicative fitness. However, it is also possible these changes could be functionally compensated by linked polymorphism in the virus and/or counterbalanced by some other host fitness benefit. However, we have not found any other polymorphism linked to the K203/R204 change to date.
Selection of viral adaptations to polymorphic host responses mediated by T cells, NKcells, antibodies and antiviral drugs are well described for other RNA viruses such as HIV and HCV 4,11 . HIV-1 adaptations to human leucocyte antigen (HLA)-restricted T-cell responses have also been shown to be transmitted and accumulate over time 12,13 . As previously shown for SARS-CoV, T-cell responses against SARS-CoV-2 are likely to target the nucleocapsid 14 . Notably, SARS-CoV-2 R203K/G204R polymorphisms modify the predicted binding of the HLA-C*07 allele to a putative Tcell epitope containing these residues. Escape from HLA-C-restricted T-cell responses may conceivably confer a fitness advantage for SARS-CoV-2, particularly in European populations where HLA-C*07 is prevalent and carried by >40% of the population (www.allelefrequencies.net).
The replication characteristics and plasticity of small, highly mutable viruses such as HIV and HCV are distinct from SARS-CoV-2, which is significantly less variable.