Machine intelligence design of 2019-nCoV drugs

Wuhan coronavirus, called 2019-nCoV, is a newly emerged virus that infected more than 9692 people and leads to more than 213 fatalities by January 30, 2020. Currently, there is no effective treatment for this epidemic. However, the viral protease of a coronavirus is well-known to be essential for its replication and thus is an effective drug target. Fortunately, the sequence identity of the 2019-nCoV protease and that of severe-acute respiratory syndrome virus (SARS-CoV) is as high as 96.1%. We show that the protease inhibitor binding sites of 2019-nCoV and SARS-CoV are almost identical, which means all potential anti-SARS-CoV chemotherapies are also potential 2019-nCoV drugs. Here, we report a family of potential 2019-nCoV drugs generated by a machine intelligence-based generative network complex (GNC). The potential effectiveness of treating 2019-nCoV by using some existing HIV drugs is also analyzed.


Introduction
A cluster of pneumonia cases of unknown cause emerged with connections to Huanan South China Seafood Market in Wuhan city, Hubei Province of China in late December 2019. On January 7, 2020, a positive-sense, single-stranded RNA coronavirus (CoV) was identified as the causative agent, and the World Health Organization (WHO) named this novel coronavirus as 2019-nCoV on January 10. By January 30, a total of 9692 confirmed cases with 213 deaths had been reported in the world. As reported in, 1 in comparison with the basic reproduction number (R 0 ) of the severe acute respiratory syndrome (SARS) epidemic, which is about 4.91, R 0 of the 2019-nCoV epidemic was estimated as high as 6.47, which implies the severity of this widespread dissemination caused by 2019-nCoV. Under the current health emergency, many researchers around the world engaged in the investigation of the genetic and functional data of 2019-nCoV and compare to other coronaviruses to design proper infection control strategy 2, 3 and seek potential drugs that can prevent and/or cure this serious epidemic. [4][5][6][7] On January 10, the genome sequence of 2019-nCoV was first released on GenBank (accession MN908947) by Yong-Zhen Zhang's group at Fudan University. 8 Subsequently, phylogenetic analysis revealed that 2019-nCoV belonged to the Betacoronavirus genera and its closest whole-genome relatives are two SARS-like coronaviruses from bats, i.e., ZC45 and ZXC21, which shared about 89% sequence identity with 2019-nCoV. 6,7 Moreover, detailed sequence alignment data indicates that there is significant genetics distance between the 2019-nCoV and the human SARS-CoV. Therefore, the 2019-nCoV should be considered as a new type of bat coronavirus. However, this observation raised a worth-thinking issue whether the 2019-nCoV has the same infective mechanisms as the human SARS-CoV.
The spike protein (S-protein), which is known as the multi-functional molecular machine that mediates coronavirus entry into host cells 9 is discussed to answer the question mentioned above. According to the literature, 5,6 the 2019-nCoV and the SARS-CoV have a high amino acid sequence homology of the S-protein, which can result in severe human infections. Further structural similarity analysis showed and confirmed that the S-protein of 2019-nCoV uses the same human cell receptor angiotensin-converting enzyme 2 (ACE2) as SARS-CoV. 4,10 Therefore, similar approaches can be applied to prevent the S-protein binding with the receptor ACE2.
S-protein is one of the four proteins in coronavirus, which can be cleaved by a host cell furin-like protease (non-structural protein in coronavirus) into two functional units, S1 and S2. S1 makes up the large receptorbinding domain (RBD) to facilitates virus infection by binding to host receptors and S2 forms the stalk of the spike molecule. 11 Therefore, one possible way to control the infection is to seek protease inhibitors to prevent the S-protein of 2019-nCoV cleaved into S1. Another possible way is to seek specific drugs that have the ability to inhibit RBD binding to host receptors. However, consider the fact that the diversity of CoV is reflected in the variable S-proteins and the gene of S-protein is easy to mutate, [12][13][14][15] it is not economical to design a new drug to prevent the RBD binding to host receptor directly. Moreover, considering the high sequence identity between viral proteases of 2019-nCoV and SARS-CoV, seeking protease inhibitors to treat respiratory diseases induced by SARS and this novel pneumonia will be our first choice. Viral proteases are common targets in dealing with human viruses such as the HIV virus and hepatitis C virus. Protease inhibitors are remarkably effective in blocking the replication of coronavirus, including the SARS and the Middle East respiratory syndrome (MERS), providing a promising foundation for the development of anticoronaviral therapeutics. An overview of SARS-CoV 3CL protease inhibitors has been reported. 16 However, currently, there are no effective anti-SARS-CoV drugs available despite there are more than 3500 publications. It is true that there were only 8422 infected cases and 916 deaths reported for SARS-CoV, which might make the drug development unprofitable. The fact that we are unprepared for another coronaviral pandemic, like the Wuhan coronavirus outbreak, causes the world economy trillion dollars, not to mention the loss of many lives.
The present work makes use of a recently developed generative network complex (GNC) 17 to explore potential protease inhibitors for curing 2019-nCoV. We generate anticoronaviral therapeutic candidates for 2019-nCoV and evaluate their druggable properties. We also examine the potential of repurposing HIV protease inhibitors, Aluvia and Norvir, for 2019-nCoV.
properties. The next component is the MathPose model which is used to predict the three-dimensional (3D) structure information of the compounds selected by 2DFP-DNN. The bioactivities of those compounds are further estimated by the structure-based deep learning model named MathDL. 18 The druggable properties predicted by this last component of our GNC are used as an indicator to select the promising drug candidates. Figure 1: A schematic illustration of the generative network complex. SMILES strings (SSs) are encoded into latent space vectors via a gated recurrent neural network (GRU)-based encoder. These vectors are modified in the molecule generator to achieve desirable druggable properties, such as binding affinity, partition coefficient (LogP), similarity, etc., predicted by pre-trained deep neural networks (DNNs). The resulting drug-like molecules are translated into SSs by a GRU-based decoder. The physical properties of these SSs are validated by 2D fingerprint-based multitask DNNs. Promising drug candidates are fed into a MathPose unit to generate 3D structures, which are further validated by a mathematical deep learning (MathDL) center to select new drug candidates. 18

Autoencoder
The autoencoder, consisting of an encoder, a latent space, and a decoder, is used to encode a molecular SMILES string into a latent space representation X, which, after being further modified by a molecular generator, is translated back to a SMILES string by a decoder. Both the encoder and decoder are constructed by using gated recurrent neural networks (GRUs). GRUs can deal with the vanishing gradient problem occurred in recurrent neural network (RNN) models but are simpler than long-short-term memory (LSTM) models. GRUs are suitable for moderately complex sequences, such as small molecular SMILES strings. A pre-trained autoencoder model developed by Winter et al is adopted in the present work. 19 The latent space vector (X ∈ R n ) or molecular representation has the dimension of 512 (n = 512).

Molecule generator
In the present approach, novel molecules are initially designed at the molecule generator using a three-step procedure. In the first step, the latent space representation of a seed molecule is evaluated for their drug-like properties, such as binding affinities, solubility (logS), partition coefficient, similarity, etc., via pre-trained DNNs. The DNN consists of two hidden layers with 1024 neurons in each layer. In our second step, evaluation results ({ŷ i (X)} ) are compared with a set of target values ({y i0 }) via a loss function, where k i is a preselected weight coefficient for the ith property. The last step is to optimize the loss function with a gradient decent algorithm. The resulting new vector X is sent back to the pre-trained DNNs for reevaluation until the loss function is smaller than a given tolerance. Only the desirable molecule representation X is sent to the decoder to generate a SMILES string. Alternatively, a Monte Carlo procedure can be used to replace the gradient decent.

2D fingerprint-based predictor (2DFP)
New SMILES strings generated by the decoder is passed to 2D fingerprint-based predictors (2DFPs) to reevaluate druggable properties. 17 These predictors are pre-trained deep neural networks involving multiple hidden layers with hundreds or even thousands of neurons on each layer. During the training, weights on each layer are updated by backpropagation. The multitask deep learning architecture is often used to enhance small dataset predictions. The input 2D molecular fingerprints are generated from a combination of ECFP 20 and MACCS 21 fingerprints, yielding 2214 bits of features (2048 bits from ECFP and 166 bits from MACCS) in total. RDKit 22 is used for to translate SMILES strings into 2D fingerprints. The output drug properties include binding affinity, logP, and logS, etc.

MathDL for druggable property predictions
Our MathDL is a mathematical representation-based deep learning platform designed for predicting various druggable properties of 3D molecules. 18 Mathematical representations used in MathDL are algebraic topology (such as persistent homology), differential geometry, and graph theory-based algorithms developed over the past many years. These approaches were repeatedly validated by their top performance in free energy prediction and ranking at D3R Grand Challenges, a worldwide competition series in computer-aided drug design (https://drugdesigndata.org/about/grand-challenge). 18,23 More details about the mathematical representation of complex molecules can be found in a recent review. 24 A variety of datasets, particularly, PDBbind datasets, 25 were used in our training of deep learning networks. A further discussion of MathDL is given in our recent work. 17

MathPose for 3D structure prediction
MathPose is a 3D pose predictor that converts SMILES strings into 3D poses with references of target molecules. For a given SMILES string, about 1000 3D structures are generated by several common docking software tools, i.e., Autodock Vina, 26 GOLD, 27 and GLIDE. 28 Additionally, a selected set of known complexes is re-docked by the aforementioned three docking software packages to generate at 100 decoy complexes per input ligand as a machine learning training set. In this training set, the calculated root mean squared deviations (RMSDs) between the decoy and native structures are used as machine learning labels. Then, we set up MathDL models and apply them to pick up the top-ranked pose for the given ligand. The MathPose-generated top poses are fed to the MathDL for druggable property evaluation. Our MathPose was the top performer in D3R Grand Challenge 4 in predicting the poses of 24 beta-secretase 1 (BACE) binders. 18

Sequence identity analysis
The sequence identity is defined as the percentage of characters which match exactly between two different sequences. The sequence identities between 2019-nCoV protease and some other coronaviral proteases are presented in Table 1. It is seen that 2019-nCoV protease is very close to SARS-CoV protease, but is distinguished from other proteases. Clearly, 2019-nCoV has a strong genetic relationship with SARS-CoV. Additionally, the available experimental data of SARS-CoV protease inhibitors can be used as the training set to generate new inhibitors of 2019-nCoV protease.

Structure similarity analysis
The 2019-nCoV protease (PDB ID 6lu7) and SARS-CoV 3CL protease (PDB ID: 2gx4) have an excellent similarity. As shown in Fig. 2, two crystal structures are essentially identical to each other. Particularly, the RMSD of two crystal structures at the binding site is 0.53 Å. When we try to carry a homology modeling of 2019-nCoV protease structure from its sequence using SARS-CoV 3CL protease (PDB ID: 2gx4) as a model, the resulting 2019-nCoV protease homology structure has an RMSD of 0.9 Å (or 0.2 Å at the binding site region) with its crystal structure 6lu7. The high structural similarity between the two proteases suggests that anti-SARS-CoV chemicals can be equally effective for the treatment of 2019-nCoV.

SARS-CoV protease inhibitor dataset
ChEMBL, 29 an open database which brings chemical, bioactivity, and genomic data together to translate genomic information into effective new drugs is employed to construct our 2019-nCoV training set. Considering the high sequence identity between viral proteases of 2019-nCoV and SARS-CoV, we take the protease of SARS-CoV as the input target in ChEMBL and a total 115 ChEMBL IDs of the target can be found. Therefore, our 2019-nCoV training set is built up by 115 SARS-CoV protease inhibitors. Figure 4 describes the distribution of experimental ∆G values of 2019-nCoV training set. It can be seen that the experimental ∆G ranges from −10.0 kcal/mol to 7.5 kcal/mol, and most training samples have the experimental ∆G located in the range of [−10, −5] kcal/mol. Followed by the second law of thermodynamics, the more negative ∆G can result in a more spontaneous binding process. Figure 3 depicts the top five anti-SARS CoV compounds and their banding affinities. 29 In this work, 115 SARS-nCoV protease inhibitors from ChEMBL are used as the 2019-nCoV training set. A collection of these 115 compounds is given in the Supplementary material.

Binding affinity training sets
The PDBbind database is a yearly updated collection of experimentally measured binding affinity data (Kd, Ki, and IC50) for the protein-ligand complexes deposited in the Protein Data Bank (PDB). PDBbind refined set contains high-quality X-ray crystal structures of protein-ligand complexes and associated binding affinities. 25 The refined set is selected by filtering through binding data, crystal structures, as well as the nature of the complexes. 25 The PDBbind 2018 refined set of 4463 complexes is employed as the major part of our binding affinity training set. Additionally, taring data are collected from recent D3R Grand Challenges   (https://drugdesigndata.org/about/grand-challenge) and a few hundreds of Pfizer molecules, which, however, mostly unrelated to the present coronaviral protease.

Binding affinity analysis
Binding free energies are computed with four methods, namely, the latent space binding predictor (LS-BP), the 2D fingerprint predictor (2DFP), a 3D deep learning model trained with the mixture all datasets, including the dataset for coronaviral protease (denoted as "3DALL"), and finally, a 3D deep learning multitask model trained with the dataset for coronaviral protease as a separated task (denoted as 3DMT).  Table 2 as references. Table 2 lists a few other druggable properties, including partition coefficient (logP), solubility (logS), and synthesizability.
The top-ranking candidate of our generated molecules is MSU3298 (see Figure 5). Its predicted binding affinity to the nCoV-2019 protease is -10.56 kcal/mol, which is higher than that of the best candidate CHEMBL222234 (-10.02 kcal/mol). The high binding affinity is due to the existence of many hydrogen bond acceptors and donors and forming a strong hydrogen bond network with nCoV-2019 protease. For example, the strongest hydrogen 6 . CC-BY-NC 4.0 International license (which was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint this version posted February 4, 2020.     affinity is also due to the strong interaction at the tail of the molecule. One Cl atom on the quaternary ring forms a hydrogen bond with the side chain of residue Arg188. One O atom in this tail also has a hydrogen bond with the main chain of residue Tyr54. In the head of the molecules, the two Cl atoms interact with the methanethiol of residue Cys145 and the main-chain amino of residue Glu166. These hydrogen bonds promise a strong binding to 2019-nCoV protease binding site. The third molecule is MSU3245 (see Figure 7) with a binding affinity -9.55 kcal/mol. The strongest hydrogen bonds between this molecule and the protease are the Cl atom on the benzene ring of the molecule and the side-chain hydroxyl of the residue Ser144. Additionally, the Cl atom of the ternary ring and the methanethiol in residue Cys144.

Solubility
Aqueous solubility, a chemical property denoted by its logarithm value logS, reveals how a solute dissolves in a solvent which will affect absorption, distribution, metabolism, and elimination processes (ADME) in drug discovery and other pharmaceutical fields. 30 It is part of the pharmacokinetics studies. In this work, we calculate the logS values for all of the new potential anti-2019-nCoV drugs using our 2DFP-based logS predictor. Table 2 lists top 15 anti-2019-nCoV candidate molecules with their druggable properties. It is seen that the smallest logS is -6.44 and the largest value is between −5.000 and −1.000. However, only two potential anti-2019-nCoV drugs (i.e., MSU2313 and MSU3289) in Table 2 have the logS values in the range of [−5.00, −1.00], while the others have a little bit higher value of logS. One possible reason is that our 2DFP-based calculation of logS may have a systematic error. Another possible explanation is that our anti-2019-nCoV drug candidates may not be absorbed through membranes as easily as some other drugs on the market. In our future study, a stronger logS constraint will be imposed in our molecule generator.

Partition coefficient
The partition coefficient, which measures how hydrophilic or hydrophobic a chemical substance is, is defined as the ratio of concentrations of a solute in a mixture of two immersible solvents at equilibrium. 33 The logarithm of partition coefficient, denoted logP, is a well-known coefficient which plays an essential role in governing kinetic and dynamic aspects of drug action. In this paper, we will employ an Open-Source Cheminformatics Software Rdkit 22 to calculate logP values of our 2019-nCoV drug candidates to evaluate the reliability of the potential 2019-nCoV drugs we predicted. All of the logP values of all predicted molecules can be found in the Supplementary Materials. While the logP values of the predicted top 15 drug candidates are presented in Table 2. From the table, it can be observed that most 2019-nCoV drug candidates we predicted have the logP value smaller than 5, which matches one of the rules in "Lipinski's rule of five". 34 Moreover, the ritonavir, an HIV protease inhibitor already on the market, has a predicted logP = 5.91, which shows that our potential drugs with logP values slightly larger than 5 can still be considered as druggable molecules.
Moreover, consider the fact that logP also has the ability to measure the solubility of the solute in liquids, we are more likely to say that our logS calculation method in subsection 4.1 is not as precise as expected. These issues can be addressed by placing a stronger logS constraint in the latent space.

11
. CC-BY-NC 4.0 International license (which was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint this version posted February 4, 2020. . https://doi.org/10.1101/2020.01.30.927889 doi: bioRxiv preprint

Synthesizability
Although we have the chemical structures of possible anti-2019-nCoV drugs, it is essential for us to estimate the feasibility of synthesis (synthetic accessibility) of these molecules. The synthetic accessibility score (SAscore) between 1 (easy to make) and 10 (very difficult to make) is described in. 35 The SAscores of our drug candidates are calculated by the Rdkit as the evaluation of the molecules' synthesizability. The last column in the Table 2 lists the SAscores of our Top 15 anti-2019-nCoV molecules. The molecule ID: MSU3519 has the highest SAscore equals to 4.69, which reveals that most of our potential anti-2019-nCoV molecules are quite easy to synthesize.

Effectiveness of some anti-HIV/AIDS drugs for 2019-nCoV
Lopinavir is an antiretroviral medication used to inhibit HIV/AIDS viral protease. It is often used as a fixed-dose combination with another protease inhibitor, ritonavir, sold under the name Kaletra or Aluvia. Ritonavir, sold under the trade name Norvir, is another antiretroviral medication. Its combination with Lopinavir is known as highly active antiretroviral therapy (HAART). Although there is no tractable clinical evidence, Kaletra or Aluvia has been proposed as a potential anticoronavirus drug for 2019-nCoV. The possibility of repurposing some HIV drugs for SARS-CoV treatment has also studied in the literature. 16 It is important to evaluate their binding affinities, which are obtained with two ligand-based methods (i.e., LS-BP and 2DFP) and two 3D models (3DALL and 3DMT). To carry out 3D model predictions, we dock them to the 2019-nCoV protease inhibition site. The resulting complexes are optimized with molecular dynamics and then evaluated by 3DALL and 3DMT. Table 1 shows the low sequence identity between HIV viral protease and 2019-nCoV protease, which might suggest the limited potential for repurposing Aluvia and Norvir for 2019-nCoV treatment. For Lopinavir, our LS-BP and 2DFP predicted the binding affinities of -5.66 kcal/mol and -5.54 kcal/mol, respectively. For Ritonavir, similar low binding affinities of -5.14 kcal/mol and -4.96 kcal/mol were predicted by our LS-BP and 2DFP, respectively. However, our 3D model 3DALL predicted better binding affinities, i.e., -7.78 kcal/mol and -8.44 kcal/mol for Lopinavir and Ritonavir, respectively. The other 3D model, 3DMT, also predicted moderately high binding affinities of -8.13 kcal/mol and -8.07 kcal/mol for Lopinavir and Ritonavir, respectively. Considering the fact that the small training set for LS-BP and 2DFP models is very small, the results predicted by 3D models are more reliable. Figures 20 and 21 indicate that these drugs have reasonable dock poses with 2019-nCoV protease. Therefore, HIV drugs Kaletra (or Aluvia) and Norvir might indeed have a moderate effect in the treatment of 2019-nCoV. However, Many new compounds generated by our GNC appear to have better druggable properties than these HIV inhibitors do.  epidemic. Although we know quite a little about 2019-nCoV, it is fortunate that the sequence identity of the 2019-nCoV protease and that of severe acute respiratory syndrome virus (SARS-CoV) is as high as 96.1%. In this work, we show that the protease inhibitor binding sites of 2019-nCoV and SARS-CoV are almost identical, which provides a foundation for us to hypothesize that all potential anti-SARS-CoV chemotherapies are also effective anti-2019-CoV molecules. Additionally, we employed a recently developed generative network complex (GNC) to seek potential protease inhibitors for effective treatment of pneumonia caused by 2019-nCoV. Two datasets 13 . CC-BY-NC 4.0 International license (which was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint this version posted February 4, 2020. . https://doi.org/10.1101/2020.01.30.927889 doi: bioRxiv preprint are utilized in this work. One is a SARS-CoV protease inhibitor dataset, which is constructed by collecting 115 SRAS-CoV inhibitors from open database ChEMBL. The other dataset is a binding affinity training set mainly containing the PDBbind refined set. Our GNC model predicts over 8000 potential anti-2019-nCoV drugs which are evaluated by a latent space binding predictor (LS-BP) and a 2D fingerprint predictor (2DFP). Promising drug candidates are further evaluated by two 3D deep learning models trained with all the training sets together, including the dataset for coronaviral protease (3DALL), and the 3D deep learning multitask model trained with the dataset for coronaviral protease as a separated task (3DMT). Furthermore, we choose 15 potential anti-2019-nCoV drugs to analyze partition coefficient (logP), solubility (logS), and synthetic accessibility score (SAscore) according to binding affinity ranking computed by the 3DALL model. The reasonable logP, logS, and SAscore show that our top 15 anti-2019-nCoV drug candidates are potentially effective for inhibiting 2019-nCoV. Finally, the effectiveness of some anti-HIV/AIDS drugs for treating 2019-nCoV is analyzed. Although HIV drugs Kaletra (or Aluvia) and Norvir might indeed have a moderate effect in the treatment of 2019-nCoV, the analysis of these anti-HIV/AIDS drugs together with our top 15 anti-2019-nCoV molecules shows that the new compounds generated by our GNC appear to have better druggable properties than these HIV inhibitors do.
SupplementaryMaterial-2.csv: The SMILES strings, experimental binding affinities of 115 potential SARS inhibitors from ChEMBL. They are used as a training set for our GNC.