Stepwise Evolution and Exceptional Conservation of ORF1a/b Overlap in Coronaviruses

Abstract The programmed frameshift element (PFE) rerouting translation from ORF1a to ORF1b is essential for the propagation of coronaviruses. The combination of genomic features that make up PFE—the overlap between the two reading frames, a slippery sequence, as well as an ensemble of complex secondary structure elements—places severe constraints on this region as most possible nucleotide substitution may disrupt one or more of these elements. The vast amount of SARS-CoV-2 sequencing data generated within the past year provides an opportunity to assess the evolutionary dynamics of PFE in great detail. Here, we performed a comparative analysis of all available coronaviral genomic data available to date. We show that the overlap between ORF1a and ORF1b evolved as a set of discrete 7, 16, 22, 25, and 31 nucleotide stretches with a well-defined phylogenetic specificity. We further examined sequencing data from over 1,500,000 complete genomes and 55,000 raw read data sets to demonstrate exceptional conservation and detect signatures of selection within the PFE region.

Coronaviruses have large 26-32 kbp positive-strand RNA genomes. The initial 2 = 3 of the genome is occupied by an open reading frame (ORF) ORF1ab encoding nonstructural proteins essential for the coronaviral life cycle. As the designation "ab" suggests, it contains two reading frames with the 3 0 -end of ORF1a overlapping with the 5 0 -terminus of ORF1b. ORF1b is in À1 phase relative to ORF1a and translated via the À1 programmed ribosomal frameshifting controlled by the PFE. As ORF1b encodes crucial components of coronavirus transcription/replication machinery, including the RNAdependent RNA polymerase (RdRp), disrupting PFE abolishes viral replication completely (Brierley 1995;Plant et al. 2010;Sola et al. 2015;Kelly et al. 2020). PFE consists of three consecutive elements: 1) an attenuator loop, 2) the "NNN WWW H" slippery heptamer, and 3) a pseudoknot structure (Kelly et al. 2020;Huston et al. 2021). The sequence and structural conformation of these elements determine the efficiency of the frameshift event, which ranges from 15% to 30% in SARS-CoV and SARS-CoV-2 (Baranov et al. 2005;Kelly et al. 2020). Because disruption of PFE arrests viral replication, it is a promising therapeutic target. As a result, a number of recent studies have scrutinized its characteristics (reviewed in Rangan et al. (2021)) revealing a fluid secondary structure (Iserman et al. 2020;Ziv et al. 2020;Huston et al. 2021). In addition to secondary structures, PFE harbors the overlap between ORF1a and ORF1b. It is defined as the stretch of sequence from "H" in the slippery heptamer to the stop codon of ORF1a. The position of the ORF1a stop codon determines overlap length. For example, in SARS-CoV-2, it is 16 bp, while in mouse hepatitis virus (MHV) it is 22 nt (Plant et al. 2010).
Our group has been interested in the evolutionary dynamics of overlapping coding regions (Nekrutenko et al. 2005;Chung et al. 2007;Szklarczyk et al. 2007). The vast amount of newly generated sequence and functional data-a result of the current SARS-CoV-2/COVID-19 pandemic-provides an opportunity to re-examine our current knowledge. The length of the ORF1a and ORF1b overlap is phylogenetically conserved. It evolved in a stepwise manner, where the changes in the overlap length are results of the loss of ORF1a stop codons leading to ORF1a extension, and the acquisition of insertions and deletions causing early stops of ORF1a.
Distance-based methods had shown that the d-coronavirus genus was an early split-off lineage compared to a-, b-, and c-coronavirus ( fig. 1). Comparisons of the RdRp, 3CL pro , HEL, M, and N proteins suggested that cwas more closely related to d-coronavirus, while aand b-coronavirus cluster together forming a distant clade (de Groot et al. 2012;Woo et al. 2012; Coronaviridae Study Group of the International Committee on Taxonomy of Viruses 2020). However, comparing the S protein trees, aand d-coronavirus

Brief Communications
ß The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Open Access share a higher amino acid identity, while band c-coronavirus cluster together ). Due to this, we initially assumed that a, b, and c formed an unresolved trifurcation ( fig. 1). To assess all possible configurations within this region, we surveyed all genomic sequences of family Coronaviridae available from the National Center for Biotechnology Information (NCBI; see Materials and Methods section). The distribution of overlap lengths among 4,904 coronaviral genomes (supplementary table S1, Supplementary Material online) is shown in supplementary fig. 1, Supplementary Material online. There are five distinct overlap length groups (7, 16, 22, 25, and 31 nt) with clear taxonomic specificity.
We then compared the first 15 amino acids of ORF1b in all 4,904 entries ( fig. 2). The amino acid sequences are highly conserved: positions 1 (R), 2 (V), 4 (G), 7 (S), 11-13 (ARL), and 15 (P) are almost invariable and highly redundant. Next, we compared the underlying nucleotide sequences of the PFE region ( fig. 3). This suggests the following potential series of evolutionary events. d-Coronavirus with 7 nt overlap most likely represents the ancestral state. Comparing coronaviruses with 7 nt (d-coronavirus) and 31 nt (a-and c-coronavirus) in  (fig. 3D). The variable position of the stop codon likely has an implication to the frameshift efficiency in these taxa as was shown by Bhatt et al. (2021). These authors demonstrated that extension of the distance between the slippery heptamer and the stop codon of 0-frame decreases frameshifting frequency: an increase in the distance by 15 nucleotides, as is the case in aand c-coronaviruses ( fig.  3), decreases efficiency by $20%, while removal of the stop decreases it by half.

MBE
The abundance of SARS-CoV-2 sequencing data allows examining the substitution dynamics in population-and individual-level sequencing data. For population-level analysis, we identified variants in the PFE region from >1,550,000 genome sequences available from GISAID (see Materials and Methods section). However, because GISAID contains only assembled genomes, these data do not provide information about individual-level (intrasample) variation. Hence, we performed an additional detailed analysis of >55,000 samples generated with the COG-UK (Lythgoe et al. 2021) consortium (see Maier et al. (2021) for analysis details). A summary of results from both analyses is shown in table 1. There is little variation in the PFE region as the fraction of samples containing individual substitutions appears to be small (the two "Count" columns in table 1). Furthermore, the 30 out of 36 substitutions in table 1 are consistent with being a result of RNA editing events from APOBEC (Chen and MacCarthy 2017) or ADAR (Bazak et al. 2014) enzymatic complexes. The remaining six substitutions (all transitions) are predominantly located in the loop regions of the predicted PFE secondary structure (Huston et al. 2021) and thus likely have minimal effect on the secondary structure.
Through a comparative analysis of GISAID sequences, we found that several codons with non-negligible levels of variation (table 2) were subject to purifying selection: RdRp: 1 (A13,443>C/G), RdRp: 31 (13,532 A>G/C), RdRp: 32 (13,535 C>T). This is consistent with a strong degree of functional constraint. Interestingly, this analysis also identified a single codon: RdRp: T26I (13,516 T>C), which has been subject to pervasive positive selection since early 2021. Most of the sequences with this substitution are in the B.1.1.7 and B.1.177.77 lineages (this is a consensus majority mutation in B.1.177.77 and B.1.614 lineages). RdRp: T26I is present at low frequencies in many viral lineages but is increasing in prevalence in recent months (0.5-1.0% global prevalence in recent samples). Functional significance, if any, for this substitution has not been reported.
Our results provide an alternative way to assess exceptional conservation of the PFE using publicly available sequence data highlighting the fact that the entire PFE region appears to be under strong purifying selection. These patterns are similar to observations obtained from deep mutational scanning where any alteration at the majority of PFE region sites have deleterious effects on the frameshift efficiency (e.g., Carmody et al. 2021).

Coronavirus Entries Retrieval and Filter
The 35,152 coronaviral entries in the NCBI taxonomy database were sorted by length, and only those longer than 14,945 nt were kept, leaving a total of 4,939 genomes. The slippery site and following overlap sequences were manually inspected, in case the slippery site was incorrectly annotated. We further filtered out those entries if they contained no annotation information, or had gapped sequences in the overlap. 4,904 coronavirus entries were selected using this approach (supplementary table S1, Supplementary Material online).

Amino Acid Alignment and Nucleotide Alignment of the Overlap Region
For all d-coronavirus entries in supplementary table S1, Supplementary Material online, the first 13 amino acids of ORF1b were taken to generate a consensus sequence using WebLogo (Crooks et al. 2004). The same was done to a-coronavirus and c-coronavirus. Within b-coronavirus, for Nobecovirus, Embecovirus, and Merbecovirus, the first 14  C  T  1,812  ----13,429 a  C  T  460  ----13,430 a  C  T  169  ----13,431 a  C  T  517  ----  MBE amino acids were used to build the consensus; for Hibecovirus and Sarbecovirus, the first 13 amino acids were used. In terms of the nucleotide sequence alignments, for each genus/subgenus, the nucleotide sequences used to generate the amino acids mentioned above were taken to make the nucleotide consensus sequence using WebLogo.

Processing of GISAID Data
Each genome was subjected to codon-aware alignment with the NCBI reference genome (accession number NC_045512) and then subdivided into ten regions based on CDS features: ORF1a (including nsp10), ORF1b (starting with nsp12), S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, and ORF10. For each region, we scanned and discarded sequences containing too many ambiguous nucleotides to remove data with possible sequencing errors. Thresholds were 0.5% for the S gene, 0.1% for ORF1a and ORF1b genes, and 1% for all other genes. We mapped individual sequences to the NCBI reference genome (NC_045512) using a codon-aware extension to the Smith-Waterman algorithm implemented in the BioExt package Gianella et al. 2011) and translated mapped sequences to amino-acids. Codon sequences were next mapped onto the amino-acid alignment. Variants were called directly. Selection analyses were performed using the protocols used previously (Faria et al. 2021;Tegally et al. 2021

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.