Codon Usage and Phenotypic Divergences of SARS-CoV-2 Genes

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which first occurred in Wuhan (China) in December of 2019, causes a severe acute respiratory illness with a high mortality rate, and has spread around the world. To gain an understanding of the evolution of the newly emerging SARS-CoV-2, we herein analyzed the codon usage pattern of SARS-CoV-2. For this purpose, we compared the codon usage of SARS-CoV-2 with that of other viruses belonging to the subfamily of Orthocoronavirinae. We found that SARS-CoV-2 has a high AU content that strongly influences its codon usage, which appears to be better adapted to the human host. We also studied the evolutionary pressures that influence the codon usage of five conserved coronavirus genes encoding the viral replicase, spike, envelope, membrane and nucleocapsid proteins. We found different patterns of both mutational bias and natural selection that affect the codon usage of these genes. Moreover, we show here that the two integral membrane proteins (matrix and envelope) tend to evolve slowly by accumulating nucleotide mutations on their corresponding genes. Conversely, genes encoding nucleocapsid (N), viral replicase and spike proteins (S), although they are regarded as are important targets for the development of vaccines and antiviral drugs, tend to evolve faster in comparison to the two genes mentioned above. Overall, our results suggest that the higher divergence observed for the latter three genes could represent a significant barrier in the development of antiviral therapeutics against SARS-CoV-2.


Introduction
The name "coronavirus" is derived from the Greek κoρωνα, due to the viruses' typical shapes being crown-like. The first complete genome of a coronavirus (mouse hepatitis virus-MHV), a positive sense, single-stranded RNA virus, was first reported in 1990 [1]. It belongs to the family Coronaviridae and ranges from 26.4 (ThCoV HKU12) to 31.7 (SW1) kb in genome length [2], having the largest genome among all known RNA viruses, with G + C contents varying from 32% to 43% [3]. The Orthocoronavirinae

Nucleotide Composition Analysis
The diverse nucleotide compositional properties were calculated for the coding sequences of the 30 CoV genomes. These compositional properties comprise the frequencies of occurrence of each nucleotide (A, U, G and C); AU and GC contents; and nucleotides G + C at the first (GC1), second (GC2) and third codon positions (GC3). To calculate these values, we used an in-house Python script. We calculated, also, the mean frequencies of nucleotides G + C at first and second positions (GC12).

RSCU
RSCU vectors for all the genomes were computed by using an in-house Python script, following the formula: In the RSCU i , X i is the number of occurrences in a given genome of codon i, and the sum in the denominator runs over its n i synonymous codons. If the RSCU value for a codon i is equal to 1, this codon has been chosen equally and randomly. Codons with RSCU values greater than 1 have positive codon usage bias, while those with a value less than 1 have relatively negative codon usage bias [19]. RSCU heat maps were drawn with the CIMminer software [20], which uses Euclidean distances and the average linkage algorithm.

Effective Number of Codons Analysis
ENC is an estimate of the frequency of different codons used in a coding sequence. In general, ENC ranges from 20 (when each amino acid is coded by the same codon) to 61 (when all synonymous codons are used on an equal footing). Given a sequence of interest, the computation of ENC starts from F α , a quantity defined for each family α of synonymous codons (one for each amino acid): where m α is the number of different codons in α (each one appearing n 1 α , n 2 α , ..., n m α times in the sequence) and n α = ∑ m α k=1 n k α . ENC then weights these quantities on a sequence: where N S is the number of families with one codon only and K m is the number of families with degeneracy m (the set of 6 synonymous codons for Leu can be split into one family with degeneracy 2, similar to that of Phe, and one family with degeneracy 4, similar to that, e.g., of Pro). ENC was evaluated by using the implementation in DAMBE 5.0 [21].

Codon Adaptation Index
The codon adaptation index CAI [22] was used to quantify the codon usage similarities between the virus and host coding sequences. The principle behind CAI is that codon usage in highly expressed genes can reveal the optimal (i.e., most efficient for translation) codons for each amino acid.
Hence, CAI is calculated based on a reference set of highly expressed genes to assess, for each codon i, the relative synonymous codon usages (RSCU i ) and the relative codon adaptiveness (w i ): In the RSCU i , X i is the number of occurrences of codon i in the genome, and the sum in the denominator runs over the n i synonyms of i; RSCUs thus measures codon usage bias within a family of synonymous codons. Then w i is then defined as the usage frequency of codon i compared to that of the optimal codon for the same amino acid encoded by i-(i.e., the one which is mostly used in a reference set of highly expressed genes). The CAI for a given gene g is calculated as the geometric mean of the usage frequencies of codons in that gene, normalized to the maximum CAI value possible for a gene with the same amino acid composition: where the product runs over the l g codons belonging to that gene (except the stop codon). This index values range from 0 to 1, where the score 1 represents the tendency of a gene to use the most frequently used synonymous codons in the host. The CAI analysis of these coding sequences is performed using DAMBE 5.0 [21]. The synonymous codon usage data of different hosts (human and other species) were retrieved from the codon usage database (http://www.kazusa.or.jp/codon/).
To study the patterns of codon biases in the coronaviruses, we used Z-score values: where ENC CoV is the average of the ratio within a codon bias index in a coronavirus v, ENC v , and σ v is the average value of ENC and its standard deviation over the whole virus v; and N v is the number of viruses (we use the standard deviation of the mean when comparing average values). The same Z-score was evaluated for codon bias index CAI.

The Similarity Index
The similarity index (SiD) provides a measure of similarity in codon usage between the virus (in our case, SARS-CoV-2) and the host under study. Formally, it is defined as follows: where a i is the RSCU value of 59 synonymous codons of the SARS-CoV-2 coding sequences; b i is the RSCU value of the identical codons of the potential host. R(a,b) is defined as the cosine value of the angle included between A and B spatial vectors, and therefore, quantifies the degree of similarity between the virus and the host in terms of their codon usage patterns. In our analysis, we considered the hosts species shown in Table 1 by Woo et al. [4]. We also considered snakes and pangolins, because they were previously identified as possible candidates for the novel coronavirus spillover into humans [9]. SiD values range from 0 to 1. Specifically, the higher the value of SiD, the more adapted the codon usage of SARS-CoV-2 to the host [23].

ENC Plot
ENC-plot analysis was performed to estimate the relative contributions of mutational bias and natural selection in shaping CUB of genes encoding proteins that are crucial for SARS-CoV-2: RdRP, the spike-surface glycoprotein (protein S), the small envelop protein (protein E), the matrix protein (M) and the nucleocapsid protein (N). The ENC-plot is a plot in which ENC is the ordinate and the GC-content in the third codon position (GC3) is the abscissa. Depending on the action of mutational bias and natural selection, different cases are discernable. If a gene is not subject to selection, a clear relationship is expected between ENC and GC3 [24]: where s represents the value of GC3 [24]. For those genes, codon preference, determined only by mutational bias, is expected to lie on or just below Wright's theoretical curve. Alternatively, if a particular gene is subject to selection, then it falls below Wright's theoretical curve. In this case, the vertical distance between the point and the theoretical curve provides an estimation of the relative extent to which natural selection and mutational bias affect CUB.
To evaluate the dots scattering from Wright's theoretical curve, we calculated the module of distance, and the box plots were drawn with an in-house Python script.

Neutrality Plot
We performed neutrality plot analysis [25] to estimate the relative contributions of natural selection and mutational bias in shaping the CUBs of five crucial coronavirus genes in the research field aiming to develop a vaccine against SARS-CoV-2: M, N, S, RdRP and E. In this analysis, the GC1 or GC2 values (ordinate) were plotted against the GC3 values (abscissa), and each gene was represented as a single point on this plane. In this case, the three stop codons (UAA, UAG and UGA) and the three codons for isoleucine (AUU, AUC and AUA) were excluded from the calculation of GC3, and two single codons for methionine (AUG) and tryptophan (UGG) were excluded in all three (GC1, GC2 and GC3) [25].
For each gene, we separately performed a Spearman correlation analysis between GC1 and GC2 with the GC3. If the correlation between GC12 and GC3 is statistically significant, the slope of the regression line provides a measure of the relative extent to which natural selection and mutational bias affect the CUBs of these genes (Sueoka 1999). In particular, if the mutational bias is the driving force that shapes the CUB, then the corresponding data points should be distributed along the bisector (slope of unity). On the other hand, if natural selection also affects the codon choice of a family of genes, then the corresponding regression line should diverge from the bisector. Thus, the divergence between the regression line and bisector quantifies the extent of codon usage preference due to the natural selection.

Forsdyke Plot
To study the mutational rates of genes M, N, S, RdRP and E, we performed an analysis by using our previously defined Forsdyke plot [26]. Each gene in SARS-CoV-2 (used as a reference) was compared to its orthologous gene in the 30 coronaviruses considered in this analysis. Each pair of orthologous genes is represented by a point in the Forsdyke plot, where protein divergence is correlated with DNA divergence (see Methods in [26] for details). The protein sequences were aligned using Biopython. The DNA sequences were then aligned using the protein alignments as templates.
Then, both DNA and protein divergences were assessed as explained in Methods in [26] by counting the number of mismatches in each pair of aligned sequences. Thus, each point in the Forsdyke plot measures the divergence between pairs of orthologous genes in the two species, as projected along with the phenotypic (protein) and nucleotidic (DNA) axis. The first step in each comparison is to compute the regression line between protein vs. DNA sequence divergence in the Forsdyke plot getting values of intercept and slope for each variant of genes (i.e., M, N, S, RdRP and E). To test whether the regression parameters associated with each variant are different or not, we followed a protocol founded by Dilucca et al., considering a p-value ≤ 0.05.

Phylogenetic Analysis
To explore the evolutionary relationships among the four genera of coronaviruses, phylogenetic analysis of the full-length genomic sequences of the 30 CoVs listed in Table 1 was performed. The sequences were aligned with the usage of ClustalO [27,28]. The resulting multiple sequence alignment was used to build a phylogenetic tree by employing a maximum likelihood (ML) method implemented in the software package MEGA version 10. 1 [29]. ModelTest-NG [30] was used to select the best-fit evolutionary model of nucleotide substitution; that is, GTR + G + I. Bootstrap analysis (100 pseudoreplicates) was conducted in order to evaluate the statistical significance of the inferred trees.

Nucleotide Composition
We calculated the nucleotide compositions of the coronavirus genomes under study (see Table 1). Previous results showed that the gene N, which follows the trend A > U > G > C [12] and the coronavirus RNA genomes are biased towards high AU content and low GC content [31]. In line with that, our results show that the nucleotide A is the most frequent base and the nucleotide composition follows the trend A > U > G > C (see Table 2). Interestingly, SARS-CoV-2 has a nucleotide composition that is similar to the other CoVs but with a different trend U> A > G > C. The GC content in SARS-CoV-2 is 0.37 ± 0.05.

All the Sequenced SARS-CoV-2 Genomes Share a Common Codon Usage
We downloaded the protein-coding sequences of SARS-CoV-2 from GISAID database, and classified each SARS-CoV-2 based on the geographic location in which it was sequenced (see tree in Figure A1). For each SARS-CoV-2 genome, we calculated the relative synonymous codon usage (RSCU), in the form of a 61-component vector. The heatmap and the associated clustering of these vectors are shown in Figure A2. We noted that the overall codon usage bias among SARS-CoV-2 strains appears to be similar. Moreover, their associated RSCU vectors did not cluster according to geographic location, thereby confirming the common origin of these genomes. Motivated by these observations, we considered a unique vector to represent the codon usage of SARS-CoV-2 in the following analyses.

Codon Usage of SARS-CoV-2
We compared the codon usage of SARS-CoV-2 with that of the other coronavirus genomes. For this purpose, we used the RSCU, which is a biologically relevant metric of the distance between the codon usage in the protein-coding sequences of these genomes. The heatmap of the RSCU values associated with the coronaviruses is shown in Figure 1. The RSCU values of the majority of the codons scored between 0 and 3.1 (see legend in Figure 1). Interestingly, the newly identified SARS-CoV-2 Wuhan-Hu-1 coronavirus clusters with the other two human coronaviruses SARSr-CoV and HCoV-229E. Moreover, in this heatmap, HCoV-HKU1 and HCoV-NL63 cluster together, consistent with viral adaptation to their host.
In line with previous observations, we show that the mean CpG relative abundance in the coronavirus genomes is markedly suppressed [32]. Specifically, GGG, GGC, CCG (pyrimidine-CpG) and ACG (purine-CpG) present low frequencies of occurrence, probably due to the relative tRNA abundance of the host. In SARS-CoV-2, the most frequently used codons are CGU (arginine, 2.34 times) and GGU (glycine, 2.42), whereas the least used codons are GGG (glycine) and UCG (serine). Of note, the most frequently used codons for each amino acid end with either U or A [18]. Heatmap was drawn with the CIMminer software [20], which uses Euclidean distances and the average linkage algorithm.

The Codon Usage of SARS-CoV-2 in Relation to the Human Host
To measure the codon usage bias in the coronavirus genomes, we used the effective number of codons (ENC) and the competition adaptation index (CAI). For each coronavirus, we calculated the average values of CAI and ENC associated with its genes. In Table 3 the ENC and CAI values for all the coronaviruses considered in this work are reported. To visually enhance the differences among the codon usage of these coronaviruses, we calculated the Z-score value of each virus with respect to the average values of ENC and CAI calculated for all 30 coronaviruses.
The human coronaviruses show different patterns of codon usage ( Figure 2). With the exception of HCoV-OC43, all the human coronaviruses have ENC and CAI values that are significantly different from the average values of ENC and CAI calculated for all 30 coronaviruses (|Z-score| > 3). Specifically, the ENC value associated with SARS-CoV-2 (51.9 ± 2.59) is significantly higher than the average of all coronaviruses (50.09 ± 1.32), indicating that SARS-CoV-2 uses a broader set of synonymous codons in its coding sequences. Moreover, the CAI of SARS-CoV-2 (0.727 ± 0.054) is markedly higher than the average one (0.69 ± 0.024), underscoring that SARS-CoV-2 uses codons that are better adapted to its host. Moreover, the CAI of SARS-CoV-2 is significantly higher than the CAI of the other human CoVs in the subfamily, thereby suggesting a greater adaptation to the human host for SARS-CoV-2 compared to the other coronaviruses. Finally, the ENC values of the three most pathogenic HCoVs having Z-scores > 3 (SARS-CoV, SARS-CoV-2 and MERS) are on average, higher than the ENCs of the other four HCoVs, which have instead Z-scores < −3. This higher CUB in terms of ENCs of the four HCoVs reinforces their strong adaptiveness to humans, as they have been circulating in the population for a long time and are now less pathogenic.
To better clarify the origin of SARS-CoV-2 and its optimization to the human host, we then calculated the average CAI for the SARS-CoV-2 genes by using different reference hosts ( Figure 3).
Interestingly, snake and human hosts correspond to the highest values of CAI, indicating that SARS-CoV-2 uses codons that are better optimized to these two organisms. Although our results suggest a possible origin of SARS-CoV-2 from snakes and its spillover into humans [33], previous studies do not support this hypothesis [34,35].  Table 3 by Woo et al. [4]. Regarding SARS-CoV-2, we considered a human host. In red, we show the human coronaviruses. Several coronaviruses have a codon usage preference values higher than the average value of the family (|Z-score| > 3). The statistically significant differences are marked with asterisks. In particular, SARS-CoV-2 genes have average values of CAI and ENC that are higher than the average of all coronaviruses. (*): |Z-score| > 3.
Similarly, to corroborate this observation, we also calculated the similarity index (SiD) of SARS-CoV-2 for the hosts reported in Figure 3 (see Figure A4). SiD values range from 0 to 1; the higher the value of SiD, the more adapted the codon usage of SARS-CoV-2 to the host [23]. Since recent studies have revealed multiple lineages of Malayan pangolin (Manis javanica) coronavirus that are similar to SARS-CoV-2 [36], we also added this organism in the present analysis. CAI was not calculated for pangolin because its genome is not well-annotated, and the five genes under investigation (M, N, S, E and RdRp) are not available. SiD values range from 0.23 (in rabbit) to 0.78 (in human). Notably, this analysis not only confirms our previous observation (see Figure 3) that SARS-CoV-2 uses codons that are better optimized to snakes (SiD = 0.75) and humans (SiD = 0.78), but reveals the same for pangolins (SiD = 0.76), bats (SiD = 0.70 ), and rats (SiD = 0.71), which are also possible hosts for SARS-CoV-2 [9].

Selective Pressures and Mutational Rates Characterizing Five Conserved Coronavirus Genes
The genome of the newly emerging SARS-CoV-2 consists of a single, positive-stranded RNA, which is approximately 30,000 nucleotides long. The newly sequenced SARS-CoV-2 genome is organized similarly to the other coronavirus genomes. Ceraolo et al. performed a cross-species analysis for all proteins encoded by SARS-CoV-2 (see Figures 3 and 4 in [37]). It encodes polyproteins common to all betacoronaviruses which are further cleaved into the individual structural proteins E, M, N and S, and the non-structural RdRP [38]. Thus, only five viral genes, classified according to their viral locations, were studied for each virus, because the short length and insufficient codon usage diversity of the other genes might have biased our results.
The corresponding gene products are involved in essential viral functions. Briefly, S protein regulates viral attachment to the receptor of the target host cell [39]; E protein functions to assemble the virions and acts as an ion channel [40] M protein plays a role in viral assembly and is involved in the biosynthesis of new virus particles [41]; N protein forms the ribonucleoprotein complex with the viral RNA [12]; RdRP catalyzes viral RNA synthesis. For these five proteins the RSCU vectors in each virus of the dataset are shown in Figures 4 and A5. We showed that SARS-CoV-2 clusters with SARSr-CoV and SARSr-Rh-BatCoV HKU3, only for genes E, M and N, consistent with the inferred phylogeny shown in Figure A3.

The ENC Plot Analysis of Individual Genes of SARS-CoV-2
To further investigate which factors account for the low codon usage bias of the coronavirus genes, we analyzed the relationship between the ENC value and the percentage of G or C in the third codon position (GC3s). The ENC-plots obtained for the five genes (M, N, S, E and RdRP) are shown separately together with Wright's theoretical curve ( Figure 5), denoting that GC3s is only determined exclusively by codon usage [24]. Thus, if mutational bias, as quantified by GC-content in the generally neutral third codon position, is the main factor in determining the codon usage among these genes, the corresponding point in the ENC-plot should lie on or just below Wright's curve. In Figure 5, all distributions lie below the theoretical curve, an indication that not only mutational bias but also natural selection play non-negligible roles in the codon choices in all genes. This is also exemplified by the violin plots in Figure 6 showing the distances between the genes and Wright's theoretical curve in the ENC-plot.
Genes N, S and RdRP are more scattered below the theoretical curve than genes M and E, implying that in the latter the codon usage patterns are pretty consistent with the effects of mutational bias. Interestingly, data points corresponding to the gene N, which is the major viral structural component needed to protect and encapsidate the viral RNA, are clustered more closely around GC3 = 0.5 (see Figure 5). This means that the displacement under Wright's theoretical curve most likely reflects the selective pressure exerted on this gene. Conversely, all other genes show a displacement towards lower values of GC3-content, thereby corroborating our previously mentioned observation that coronaviruses tend to use codons that end with A and U (see Section 3.3).

Neutrality Plot of Individual Genes of SARS-CoV-2
A neutrality plot analysis was performed to estimate the role of mutational bias and natural selection in shaping the codon usage patterns of the five genes under investigation. In this plot, the average GC-content in the first and second positions of codons (GC12) is plotted against GC3s, which is considered as a pure mutational parameter. In Figure 7, the neutrality plots obtained for genes M, N, S, E and RdRP, together with the best-fit lines and the slopes associated with them are shown.
To understand the rationale behind these results: the wider the deviation between the slope of the regression line and the bisector, the stronger the action of selective pressure. All correlations are highly significant (Spearman correlation-R 2 analysis, p-value < 0.0001). By comparing the divergences between the regression lines and the bisectors in each panel, we reveal that the five genes considered herein depend on a balance between natural selection and mutational bias. Specifically, in line with the ENC-plot analyses, the genes S and RdRP present the largest deviations of their regression lines from the bisector lines, thereby indicating a stronger action of natural selection. Conversely, the regression line for the gene M is closer to the bisector than the other genes, meaning that this gene is the least one subject to the action of natural selection. Finally, the genes E and N are intermediate between the previous cases.
Notably, almost all data points are clustered below the bisector lines, implying a selective tendency for a higher AU content in the first two codon positions than in the third one. Additionally, both GC3 and GC12 are lower than 0.5, reflecting a general preference for A and U bases in all three codon positions. Interestingly, data points associated to gene M and E are closer to the bisector lines compared to genes N, S, and RdRP. Based on this observation, we could suggest that the GC content in the first two codon positions tends to be in proportion to GC3 in genes M and E, and this partially explains the closeness of these two genes to the Wright theoretical curve in Figure 5.

Forsdyke Plot of Individual Genes of SARS-CoV-2
We analyzed the DNA divergence and protein sequence divergence that characterize these five genes by comparing the nucleotide sequences of the newly emerging SARS-CoV-2 and their corresponding protein sequences with those of other coronaviruses under study. Each SARS-CoV-2 gene was compared to its orthologous gene in the 30 coronaviruses to estimate evolutionary divergences. Each pair of orthologous genes is represented by a point in the Forsdyke plot [26], where protein divergences correlated with DNA divergence. Each point in the Forsdyke plots measures the divergence between pairs of orthologous genes in the two species, as projected along with the phenotypic (protein) and nucleotide (DNA) axis. Thus, the slope is an estimation of the fraction of DNA mutations that result in amino acid substitutions [26]. In Figure 8, a separate Forsdyke plot is shown for each gene.
Overall, protein and DNA sequence divergences are linearly correlated, and these correlations correspond to slopes and intercepts of the regression lines.
Genes M and E display quite low slopes, indicating that these proteins tend to evolve slowly by accumulating nucleotide mutations on their corresponding genes. Conversely, the steeper slopes for genes N, RdRP and S suggest that these genes tend to evolve faster compared to other ones. A plausible explanation for this observation is that protein N, due to its immunogenicity, has been frequently used to generate specific antibodies against various animal coronavirus, including SARS [42]. The viral replicase polyprotein is essential for the replication of viral RNA, and finally, gene S encodes the protein that is responsible for the "spikes" present on the surface of coronaviruses. Our results suggest that the higher divergence observed in these three proteins could represent a major obstacle to the development of an therapeutic treatment against SARS-CoV-2.

Discussion
To investigate the factors determining the codon usage patterns of SARS-CoV-2 and other coronaviruses, several analytical methods were used in our study. First, the RSCU value of the SARS-CoV-2 was calculated. Despite the relatively high mutation rate that characterizes SARS-CoV-2, as other RNA viruses, we could not find any significant differences in codon usage between its genome and the ones of the other CoVs. Moreover, their associated vectors did not cluster based on geographical position, further confirming the common origin of these genomes.
In line with the common nucleotide composition of other RNA viruses such as SARS, our results show that SARS-CoV-2 has a high AU content and a low GC content. The results also indicate that codon usage bias exists and that SARS-CoV-2 prefers U-ending codons. The codon usage bias was further confirmed by a mean ENC value of 51.9 (a value greater than 45 is considered a slight codon usage bias due to mutation pressure or nucleotide compositional constraints). These findings were also corroborated by the CAI analysis, which measures the deviation of a given protein coding gene sequence with respect to a reference set of the most highly expressed genes in the host. This suggests that those RNA viruses with high ENC values (and low CAI) adapt to the host with randomly chosen codons. Therefore, a slightly biased codon usage pattern might allow the virus to use several codons for a respective amino acid, and it might be beneficial for viral replication and translation in host cells.
We then analyzed in more detail the relationships between SARS-CoV-2 and various possible hosts other than humans. For this purpose, we calculated the average CAI and SiD values of individual SARS-CoV-2 genes against different candidate hosts. Although previous studies do not support transmission of SARS-CoV-2 from snakes to humans [34,35], we showed that SARS-CoV-2 has the highest CAI values by considering these two organisms as references, and therefore, it should use codons that are better optimized to snakes and humans. Moreover, we demonstrated that the adaptiveness of SARS-CoV-2's codon usage, as measured by SiD, is also fairly high for pangolins, rats, and bats, thereby confirming previous hypotheses regarding the possible origin of SARS-CoV-2 from these species [9].
The ENC-plot analysis indicated that natural selection plays an important role in the codon choice of the five conserved viral genes under study; namely, RdRP, S, E, M and N. However, genes N, S and RdRP are more scattered below the theoretical curve compared to genes M and E, implying that in the latter the codon usage is more a sign of mutational bias than of natural selection. According to neutrality plot analysis, the genes S and RdRP are considered to be subject to more robust action of natural selection; gene M is the least subject to natural selection; and the genes E and N are in an intermediate situation. Conversely, the regression line for the gene M is closer to the bisector than the other genes, meaning that this gene is the least subject to the action of natural selection. Finally, the genes E and N are intermediately affected regarding the previous cases.
Forsdyke plots were employed to analyze the mutation statuses of these five genes. Proteins M and E were found to have gentler slopes, thereby reflecting a tendency to evolve slowly by accumulating nucleotide mutations on their respective genes. Conversely, the steeper slopes for the three genes N, RdRP and S (encoding a protein responsible for the "spikes" present on the surface of coronaviruses), indicate that these three genes, and therefore their corresponding protein products, evolve faster compared to the other two genes.
Interestingly, all x-intercepts (see Table 4) are negative and the degree of negativity correlates with the low slope values. Recalling that the x-axis (RNA change) can be viewed as a time axis, it appears that the RNA segments encoding M and E are as resistant to change during the early period of genome divergence (negative x values) as they are during the later period of divergence when phenotypic changes can be naturally selected (positive x values). M and E are less flexible at the protein level. On the other hand, the RNA segments encoding S, RdRP and N are flexible during the early genome divergence period (high negative x values). As a result, these segments would have been more able to contribute to the initial genotypic divergence that would have decreased recombination between two genomes diverging in a common cell, thereby facilitating speciation. Under the protection of this global "reproductive isolation", the segments could then evolve during the period corresponding to positive x values. Without reproductive isolation, blending would have occurred and phenotypic divergence would be less possible.
In future studies, it would be interesting to explore why M and E are less flexible and S, R and N are more flexible towards preventing recombination. Viral RNA recombination requires recognition between two comparable RNA regions and then extensive base pairing, mediated by the kissing stem-loop interaction, to thoroughly examine sequence complementarity. Perhaps the M and E genes lack the ability to form stem-loops, but this inflexibility during phenotypic divergence is suggestive of high conservation.
The findings of the present study could be useful for developing diagnostic reagents and probes for detecting a wide range of viruses and isolates in one test and for vaccine development, utilizing the information about codon usage patterns in these genes.
In addition, an interesting potential idea for the treatment of pneumonia-related to SARS-CoV-2 and other similar viruses is a low dose of ionizing radiation (LDIR). SARS-COV-2 is an RNA virus with an expected mutation rate similar to other RNA viruses, as discussed above. This mutation rate is usually much higher than the corresponding one of any human host. Therefore, as discussed in a recent paper [43], any antiviral drug against SARS-CoV-2 would exert an intense selective pressure on the virus. This may result in highly adaptive and treatment-resistant virus types with enhanced pathogenicity. It should also be taken into consideration that the virus will create a systemic inflammatory response with detrimental effects in the host organism, i.e., acute respiratory distress syndrome (ARDS), a form of severe hypoxemic respiratory failure associated with major inflammatory injury to the lung cells and extravasation of protein-rich edema fluid into the airspace [44,45]. Low dose radiation (<0.5 Gy) has been shown to have indeed, in some cases, anti-inflammatory effects and to modulate the immune response, and has even been suggested for treating pneumonia [46]. This LDIR exposure is not expected to exert significant selective pressure on the new coronavirus. Therefore, and based also on recent suggestions, one can hypothesize that a low dose treatment of 30 to 100 cGy to the lungs of a patient with COVID-19 pneumonia could ameliorate the inflammation significantly and relieve the life-threatening systemic symptoms of the infection [47]. Table 4. Parameters of the linear regressions in Forsdyke plots. None of these plots intersect. The value of each parameter increases from M (lowest) to N (highest). Negative x values indicate flexibility to respond to global pressure to change DNA sequence (virtual time axis) in order to prevent recombination, and to thus allow species divergence (i.e., generate SARS-CoV-2). When recombination is still possible, then two diverging genomes in the same cell will blend, so militating against protein differentiation would occur in time, corresponding to positive x values.     On the horizontal axis, the 13 eukaryotic species that were considered in the comparisons are shown. The host species are ranked in ascending order. CAI values for the bat, rat, hamster, snake, pangolin, and human are higher compared to the other species.