Identifying and prioritizing potential human-infecting viruses from their genome sequences

Determining which animal viruses may be capable of infecting humans is currently intractable at the time of their discovery, precluding prioritization of high-risk viruses for early investigation and outbreak preparedness. Given the increasing use of genomics in virus discovery and the otherwise sparse knowledge of the biology of newly discovered viruses, we developed machine learning models that identify candidate zoonoses solely using signatures of host range encoded in viral genomes. Within a dataset of 861 viral species with known zoonotic status, our approach outperformed models based on the phylogenetic relatedness of viruses to known human-infecting viruses (area under the receiver operating characteristic curve [AUC] = 0.773), distinguishing high-risk viruses within families that contain a minority of human-infecting species and identifying putatively undetected or so far unrealized zoonoses. Analyses of the underpinnings of model predictions suggested the existence of generalizable features of viral genomes that are independent of virus taxonomic relationships and that may preadapt viruses to infect humans. Our model reduced a second set of 645 animal-associated viruses that were excluded from training to 272 high and 41 very high-risk candidate zoonoses and showed significantly elevated predicted zoonotic risk in viruses from nonhuman primates, but not other mammalian or avian host groups. A second application showed that our models could have identified Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) as a relatively high-risk coronavirus strain and that this prediction required no prior knowledge of zoonotic Severe Acute Respiratory Syndrome (SARS)-related coronaviruses. Genome-based zoonotic risk assessment provides a rapid, low-cost approach to enable evidence-driven virus surveillance and increases the feasibility of downstream biological and ecological characterization of viruses.

Rapid assessment of which animal viruses may be capable of infecting humans is currently 11 intractable, but would allow their prioritization for further investigation and pandemic 12 preparedness. We developed machine learning algorithms that identify candidate zoonoses using 13 evolutionary signals of host range encoded in viral genomes. This reduces lists of hundreds of 14 viruses with uncertain human infectivity to tractable numbers for prioritized research, generalizes 15 to virus families excluded from model training, can distinguish high risk viruses within families 16 that contain a minority of zoonotic species, and could have identified the exceptional risk of 17 SARS-CoV-2 prior to its emergence. Genome-based risk assessment allows identification of 18 high-risk viruses immediately upon discovery, increasing both the feasibility and likelihood of 19 downstream virological and ecological characterization and allowing for evidence-driven virus 20 surveillance. 21 22 Introduction: 23 Most emerging infectious diseases of humans are caused by viruses that originate from other 24 animal species. Identifying these zoonotic threats prior to emergence is a major challenge since 25 only a small minority of the estimated 1.67 million animal viruses may infect humans (1,2). 26 Existing models of human infection risk rely on viral phenotypic information that is unknown for 27 newly discovered viruses (e.g., diversity of species a virus can infect) or that vary insufficiently 28 to discriminate risk at the species or strain level (e.g., replication in the cytoplasm), limiting their 29 predictive value (3, 4). Since most viruses are now discovered using untargeted sequencing, 30 often involving many simultaneous discoveries with limited phenotypic data, an ideal approach 31 would quantify the relative risk of human infectivity directly from sequence data alone. By 32 identifying high risk viruses warranting further investigation, such predictions could alleviate the 33 growing imbalance between the rapid pace of virus discovery and lower throughput field and 34 laboratory research needed to comprehensively evaluate risk. 35 2 Current models can identify well-characterized human-infecting viruses from genomic 1 sequences (5,6). However, by training algorithms on very closely related viruses (i.e., strains of 2 the same species) and potentially omitting secondary characteristics of viral genomes linked to 3 infection capability, such models are less likely to find signals of zoonotic status that generalize 4 across viruses. Consequently, predictions may be highly sensitive to substantial biases in current 5 knowledge of viral diversity (2, 7). Overcoming these biases requires discovering and exploiting 6 signals of human infectivity that generalize across unrelated viruses. Empirical and theoretical 7 lines of evidence suggest such signals might exist (8,9). For example, the depletion of CpG 8 dinucleotides in vertebrate-infecting RNA virus genomes may have arisen to evade zinc-finger 9 antiviral protein (ZAP), an interferon-stimulated gene (ISG) that initiates the degradation of 10 CpG-rich RNA molecules (10). While ZAP occurs widely among vertebrates, increasingly 11 recognized lineage-specificity in vertebrate antiviral defences opens the possibility that 12 analogous, undescribed nucleic-acid targeting defences might be human (or primate) specific 13 (11). Independently, the frequencies of specific codons in virus genomes can resemble those of 14 their reservoir hosts, possibly owing to increased efficiency and/or accuracy of mRNA 15 translation (12). As such, genome compositional similarity to human-adapted viruses or to the 16 human genome may preadapt viruses for human infection (9, 13). We aimed to develop machine 17 learning algorithms which use features engineered from viral and human genome sequences to 18 predict the probability that any animal-infecting virus will infect humans given biologically 19 relevant exposure. 20 21 Results and discussion: 22 We collected a single representative genome sequence from 861 RNA and DNA virus 23 species spanning 36 viral families that contain animal-infecting species (fig. S1). We labelled 24 each virus as being capable of infecting humans or not using published species-specific reports 25 as ground truth, and trained models to classify viruses accordingly. Importantly, given diagnostic 26 limitations and the likelihood that not all viruses capable of human infection have had 27 opportunities to emerge, viruses not reported to infect humans may represent unrealized or 28 undocumented zoonoses or genuinely non-zoonotic species. Identifying these potential zoonoses 29 was an a priori goal of our analysis. 30 We first evaluated whether evolutionary proximity to human-infecting viruses 31 predictably elevates zoonotic risk. Gradient boosted machine (GBM) classifiers trained on virus 32 taxonomy or the frequency of human-infecting viruses among close relatives ("phylogenetic 33 neighbourhood" (14)) outperformed chance (median area under the receiver-operating 34 characteristic curve [AUC m ] = 0.604 and 0.558, respectively), but were no better than simply 35 ranking novel viruses by the proportion of human-infecting viruses in each family ("taxonomy-36 based heuristic", AUC m = 0.596, fig. 1A), indicating the inability of these relatedness-based 37 models to distinguish risk at scales below the viral family level.  We next quantified the performance of GBMs trained on genome composition (i.e., 14 codon usage biases, amino acid biases and dinucleotide biases), calculated either directly from 15 viral genomes ("viral genomic features") or based on similarity to three alternative sets of human 16 gene transcripts ("human similarity features"): interferon-stimulated genes (ISGs), housekeeping 17 genes, and all other genes. We hypothesized that viruses might optimally resemble ISGs since 18 both tend to be expressed concomitantly in virus-infected cells. However, we also included non-    (16)). While dendrograms using raw   (18), while all lyssaviruses are assumed to be zoonotic (19).
14 The remaining viruses classified as high priority were from families not currently considered Sorex araneus coronavirus T14 -as being at least as, or more likely to be capable of infecting 30 humans than SARS-CoV-2; these should be considered high priority for further research.  phenotypic models of zoonotic risk. The performance of our models, while imperfect, means that 4 many potential zoonoses can be identified immediately after virus discovery and genome 5 sequencing. Large-scale application of these models enables retrospective ranking-based 6 prioritization of hundreds of recognized viruses as well as prospective ranking in parallel with 7 virus discovery, spanning all RNA and DNA genome types (table S1). Importantly, our models 8 predict baseline zoonotic potential, which ultimately will be modulated by ecological 9 opportunities for emergence. Further, the societal impact of emergence will depend on capacity 10 for human to human spread and the severity of human disease, which likely require additional 11 non-genomic data to anticipate. Nonetheless, for both novel and recognized viruses, substantially 12 reduced lists of candidate zoonoses heightens the feasibility of further ecological and virological 13 characterisation.

16
Data 17 Although our primary interest was in zoonotic transmission, we trained models to predict the 18 ability to infect humans in general, reasoning that patterns found in viruses predominately 19 maintained by human-to-human transmission may contain genomic signals which also apply to 20 zoonotic viruses. Data on the ability to infect humans were obtained by merging the data of (4) 21 and (15), which contain species-level records of reported human infections. These datasets were 22 supplemented by new literature searches for 15 species (7). In all cases, only viruses detected in 23 humans by either PCR or sequencing were considered to have proven ability to infect humans. 24 All viruses for which no such reports were found were considered to not infect humans, although 25 we emphasise that many of these viruses are poorly characterised and could therefore be 26 unrecognized or unreported zoonoses. We therefore expect our models to further improve as 27 these and new viruses become better characterized. contend that including currently unrecognized viruses is unlikely to improve the predictions of 37 our models because: (a) most will be non-human infecting (an already over-represented class) 38 and hence provide little additional information, (b) those which do infect humans will not 39 generally be known to do so due to a lack of historic testing, adding misleading signals, and (c) a randomly chosen human-infecting virus would be ranked higher than a randomly chosen virus 26 which has not been reported to infect humans. When a given feature set or combination of 27 feature sets comprised < 125 features, all features were retained. This was the case for models 28 trained using only taxonomy (7 features) or phylogenetic neighbourhoods (2 features). 29 Final models were trained using reduced feature sets.  (27): is the length of the query sequence (in nucleotides), ݊ is the total number of nucleotides 16 in the training set (i.e., the size of the database searched), and

ܵ '
is bitscore for this particular 17 alignment in the original blast search. 18 19 Feature importance and clustering 20 To assess the variability in feature importance while accounting for all viruses, feature 21 importance was assessed across all 1000 models trained for bagging above. In each iteration, the 22 influence of features was assessed using SHAP values, an approximation of Shapley values 23 which describe the change in the predicted log odds of infecting humans attributable to each 24 genome feature (16). The overall importance of each feature was calculated as the mean of 25 absolute SHAP values across all viruses in the training set of a given iteration (28). 26 Because features tended to be highly correlated, we also report importance values for 27  These values were used to calculate the pairwise Euclidean distances between all virus species 5 using version 2.1.0 of the cluster library in R (31). Viruses were then clustered using 6 agglomerative hierarchical clustering, calculating distances between clusters as the mean 7 distance between all points in the respective clusters (i.e. UPGMA clustering). To explore 8 patterns common to viruses from each class, clustering was performed separately for known 9 human infecting and other viruses. 10 To compare this explanation-based clustering with virus taxonomy, we also constructed a 11 dendrogram based on taxonomic assignments as recorded in version 2018b of the ICTV master 12 species list, using all taxonomic levels from phylum to subgenus. Since some levels of the ICTV 13 taxonomy are not used consistently across all viruses, missing taxonomic levels were 14 interpolated to ensure accurate representation of the underlying taxonomy. For example, for 15 viruses which are not classified in a scheme which includes subfamilies, the next level 16 downstream -genus -was repeated, thereby treating each genus as belonging to a distinct 17 subfamily. Categorical taxonomic assignments were used to calculate pairwise Gower distances 18 between virus species (32), before performing agglomerative hierarchical clustering as described 19 above. We also assessed the ability of underlying genome feature values to reconstruct virus 20 taxonomy by performing hierarchical clustering on a Euclidean distance matrix calculated 21 directly from all genomic features (i.e. unreferenced genome, ISG similarity, housekeeping gene 22 similarity and remaining gene similarity feature sets). The similarity between dendrograms was 23 assessed using the gamma correlation index of (17), as implemented in version 1.12.0 of the 24 dendextend library in R (33). A null distribution for this statistic was calculated by randomly 25 shuffling the labels (i.e., virus species names) of both dendrograms 1000 times. To assess the 26 taxonomic depth at which dendrograms were concordant, the Fowlkes-Mallows index was 27 calculated at each possible cut-point in the dendrograms being compared (34), again using the 28 dendextend library. As before, a null distribution was generated by randomly shuffling the labels 29 of both dendrograms 1000 times.

31
Ranking novel viruses 32 To illustrate the use of our models in practice, the best performing set of models (i.e. those 33 trained using the best 125 features selected from among all genome feature-based feature sets) 34 was used to predict the probability that novel viruses are able to infect humans.