SSnet-Secondary Structure based End-to-End Learning model for Protein-Ligand Interaction Prediction

Computational prediction of bioactivity has become a critical aspect of modern drug discovery as it mitigates the cost, time, and resources required to find and screen new compounds. Deep Neural Networks (DNN) have recently shown excellent performance in modeling Protein-Ligand Interaction (PLI). However, DNNs are only effective when physically sound descriptions of ligands and proteins are fed into the network for further processing. Furthermore, previous research has not incorporated the secondary structure of the protein in a meaningful manner. In this work, we utilize secondary structure information of the protein which is extracted as the curvature and torsion of the backbone of protein to predict PLI. We demonstrate how our model outperforms


Introduction
The interaction of a protein with small molecules is a complex mechanism controlling many fundamental operations in a biological system. [1][2][3][4][5][6][7][8][9][10][11][12] These interactions are governed by a multitude of factors [10][11][12] including hydrogen bonding, 1-5 π-interactions, 6-8 hydrophobicity 9 etc. The experimental validation of Protein-Ligand Interaction (PLI) is the state-of-theart method, however, it is time-consuming and expensive. Computational methods can significantly boost time and save resources, however, due to the complex nature of PLI its prediction is a challenging computational enterprise. Therefore, it is imperative to develop computational methods to predict PLI. Reliable PLI predictions could significantly reduce the discovery time for new treatments, eliminate toxic drug candidates and efficiently guide medicinal chemistry efforts. 13 Traditional PLI relies on high-throughput screening which is an experimental technique with high cost and low efficiency. 14 Virtual screening (VS) accelerates the PLI process while greatly reducing time and resources. Broadly, VS can be divided in two major categories: Ligand Based Virtual Screening (LBVS) and Structure Based Virtual Screening (SBVS). 15 LBVS applies known sets of ligands to a target of interest and therefore, its capability to find novel chemotypes is limited. SBVS uses the 3D structure of a given target and therefore is a better choice for the discovery of novel active compounds. 16 However, SBVS has a somewhat poor performance, sometimes not being able to distinguish active from nonactive compounds. 17 Over the last few decades, many classical techniques such as force field, empirical and knowledge based 18 PLI predictions have been applied, however, often showing low performance and in some cases even discrepancies when compared with experimental bioactivities. 19 Machine learning (ML) and Deep learning (DL) approaches have recently received attention in this field. Various reviews summarize the application of ML/DL in drug design and discovery. [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] Machine learning based PLI prediction has been developed from a chemogenomics perspective 36  Most of the protein structure based models (ML/DL) for PLI predictions achieve low accuracy as i) high resolution protein-ligand pair for training is mostly absent, ii) the 3D grid for the target is a big and sparse matrix, which hinder ML/DL models to learn and predict PLI. In this work, for the first time we propose a secondary structure based model for proteins with a 1D vector contrary to 3D 39 based representations. The 1D representation is based on the curvature and torsion of protein backbone. Mathematically, curvature and torsion are sufficient to reasonably represent the 3D structure of a protein. 43 Tremendous work has been carried out to represent the ligands. Rafel at el. 44 created a model to generate Continuous Latent Space (CLP) from sparse Simplified Molecular-Input Line-Entry System (SMILES) strings (i.e., a string representation of a molecule) based on a variational autoencoder similar to word embedding. [45][46][47] Esben et al. 48 proposed an improved version of the latent space autoencoder for de novo molecule generation. Scarselli et al. 49 proposed a Graph Neural Network (GNN) to describe molecules. David et al. 50 developed Extended-Connectivity Fingerprints (ECF), which includes the presence of substructures (and therefore also includes stereochemical information) to represent molecules. Sereina et al. 51 proposed a fingerprint based on substructure and their similarity (Avalon).
End-to-end learning, a powerful ML/DL technique to exploit drug discovery and development, has gained interest in recent years. 52 The end-to-end learning technique involves i) embedding inputs to lower dimensions, ii) formulating various neural networks depending on the data available, and iii) using backpropagation over the whole architecture to minimize loss and update weights. In this work we utilize such an architecture, where the proteins and the ligands are transformed into lower dimensions with the help of fully connected dense networks. We coin this neural network based end-to-end learning model SSnet. A general overview of the SSnet model is shown as Figure 1 of the supporting information.
We analyzed the SSnet model by utilizing the Grad-CAM method. 53 Grad-CAM is a useful technique to visualize heatmaps of the activation from networks that maximally excite the input feature. In other words, it shows the important data points in the input feature that are responsible for the prediction. We show that our model learns the important residues in the protein which maximally interact with the ligand.
In the following we first demonstrate how the secondary structure of proteins can be used in ML/DL. Then we discuss the representation of ligands following the introduction of SSnet model, possible evaluation criteria, its merits and demerits. Then we explain the datasets used in this work and analyze the performance and outcomes of SSnet in the Results and Discussion section. We then unbox the SSnet model by visualizing heatmaps of the proteins.
Finally, we summarize and conclude our work and provide a future perspective.

Representation of Proteins
Protein structures exhibit a large conformational variety which has traditionally been modeled through complex combinations of primary, secondary, tertiary, and quaternary levels.
Many automated and manual sorting databases like SCOP, 54 CATH, 55 DALI, 56 and programs like DSSP, 57 STRIDE, 58 DEFINE, 59 and KAKSI 60 have provided protein classifications based on the secondary structure. However, these classifications are often conflicting. 61 A more promising approach has been introduced by Ranganathan et al. 43 based on Frenet-Serret frame and coordinates to classify secondary structure elements in proteins.
In this approach a protein is represented by the α carbons of the backbone (CA atoms) because the backbone defines a unique and recognizable structure, especially for protein categorization. 62 Even so, a significant amount of information about the protein is embedded in the secondary structure elements such as helices, β sheets, hairpins, coils, and turns.
Therefore it is imperative to incorporate the secondary structure of the protein when representing its features for machine learning algorithms. Otherwise, the algorithms will be blind to interactions dependent on the secondary structure.
The secondary structure information can be retrieved by a smooth curve generated by a cubic spline fit of the CA atoms. Figure 1 shows the arc length s, scalar curvature κ and scalar torsion τ which define the 3D curve r(s). The scalar curvature κ is expressed as a function of arc length s κ(s) = |r (s)| (1) and the scalar torsion where | · | is the norm and · is the vector triple product.  representations of these patterns. More specifically, we hypothesized that, using convolution, varying sized filters may be excellent pattern matching methods for discerning structure from these decomposition plots.  conversion into the Frenet-Serret frame and the calculation of curvature κ and torsion τ , κ and τ data (i.e decomposition data) is fed into the neural network. We denote this input as a 2D matrix, X (0) , where each column represents a unique residue and the rows corresponding the curvature and torsion. The first layer is a branch convolution with varying window sizes. That is, each branch is a convolution with a filter of differing length. We perform this operation so that patterns of varying lengths in the decomposition plot can be recognized by the neural network. Each branch is then fed to more convolutions of same window size.

Representation of Ligands
This allows the network to recognize more intricate patterns in X (0) that might be more difficult to recognize with a single convolution. The output of these convolutional branches are concatenated, pooled over the length of the sequence, and fed to a fully connected dense layer. The rightmost upper branch of Figure 3 shows a ligand vector which is generated and fed to a fully connected dense layer. The output of this layer is typically referred to as an embedding. Intuitively, this embedding is a reduced dimensionality representation of the protein and ligand. The outputs of the protein embedding and the ligand embedding are then concatenated and fed to further dense layers to predict the PLI.
The convolutional network in this research uses filter functions over the protein vector . To define the convolution operation more intuitively, we define a reshaping operation as follows: where the flattening operation reshapes the row of X (0) from indices i to i+K to be a column vector c i . This process is also referred to as vectorization. The size of the filter will then be of length K. We define the convolution operation as: where f is a function known as the rectified linear unit (ReLU), W (0) conv is the weight matrix and b (0) conv is the bias vector. This operation fills in the columns of the output of the convolution, X row=i,∀col (also called the activation or feature map). Each row of W (0) conv is considered as a different filter and each row of X (1) is the convolutional output of each of these filters.
These convolutions can be repeated such that the n th activation is computed as: We in our SSnet model use four different branches with filter sizes of κ = 5, 10, 15 and 30. we apply a maximum operation along the rows of X (N ) κ . This is typically referred to as a Global Max Pooling layer in neural networks and is repeated R times for each row in X (N ) κ : where d κ is a length R column vector regardless of the number of columns in the latent space X (N ) κ . This maximum operation, while important, has the effect of eliminating much of the information in the latent space. To better understand the latent space, we can further process X (N ) κ to understand how samples are distributed. For example, a simple operation would be to define another column vector v that denotes the total variation in each row of the latent space: The concatenation of vectors d and v help elucidate how the samples are distributed in the latent space. As such, we can use these concatenated vectors as inputs to a fully connected dense layer that can learn to interpret the latent space. This output is referred to as the embedding of the protein, y prot , and is computed as where W prot is the learned weight matrix and b prot is the bias vector of a fully connected network.
The method described above is similar to a technique recently used in speech verification systems where the window sizes need to be dynamic because the length of audio snippet is unknown. [64][65][66] In speech systems, the latent space is collapsed via mean and standard deviation operations and the embeddings provided for these operations are typically referred to as D-Vectors 64 or X-Vectors. 65 After embedding the protein and the ligand, we concatenate the vectors together and feed them into the final neural network branch, resulting in a prediction of binding,ŷ, which is expected to be closer to "0" for ligands and proteins that do not bind and closer to "1" for proteins and ligands that do bind. This final branch consists of two layers: where σ refers to a sigmoid function that maps the output to [0, 1]. If we denote the ground truth binding as y, which is either 0 or 1, and denote all the parameters inside the network as W then the loss function for the SSnet model can be defined as binary cross entropy, which is computed as: where M is the number of samples in the dataset.

Grad-CAM method for heatmap generation
A neural network generally exhibits a large number of weights to be optimized so that complex information can be learned, however, some of this information could be irrelevant to a prediction task. For example, consider the task of identifying if a certain image contains a horse or not. If all horse images also contain a date information on the image and images without horse does not contains date information, the machine will quickly learn to detect the date rather than the goal object (a horse in this case). Therefore it is imperative to verify what a neural network considers "influential" for classification after training. Ramprasaath et al. 53  Grad-CAM is computed by taking the gradient weight α k for all channels in a convolutional layer as where k is the row in the final convolutional layer , Z is a normalization term, X (N ) is the activation of the final convolutional layer, andŷ is the final layer output. The heatmap S is then computed by the weighted sum of final layer activations: This heatmap S specifies the important portions in the input sequence that are most responsible for a particular class activation. For each convolutional branch, we can apply this procedure to understand which portions of the input decomposition sequence are contributing the most, according to each filter size K = 5, 10, 15, 20. In this way, we can then map the most influential portions onto locations on the backbone of the protein. To the best of our knowledge, this procedure has never been applied to protein (or ligand) structures because Grad-CAM has been rarely applied outside of image processing.

Evaluation criteria
The evaluation criteria for PLI in general is presented by the area under the receiver operating characteristics (AUC). 67 The receiver operating characteristic curve is the plot of true positive rate vs false positive rate and the area under this curve is AUC. Thus AUC greater than 0.5 suggests that the model performs better than chance. However, AUC faces the early recognition problem (high positive rate for highest ranked ligands which are assayed first) and therefore, may incorrectly judge a model. Enrichment Factor 68,69 (EF) and Boltzman-Enhanced Discrimination of Receiver Operating Characteristic (BEDROC) considers early recognition problem. EF is defined as EF X% = Compounds selected /N X% Compounds total /N total (12) where N X% is the number of ligands in the top X% of the ranked ligands. EF thus considers an estimate on a random distribution for how many more actives can be found withing the early recognition threshold. BEDROC is interpreted as the probability that active is ranked before a ligand taken from random probability distribution in an ordered list. An exponent factor α determines the shape of the distribution. In this work, we chose α = 20 following the bench-marking of fingerprints for ligand-based virtual screening. 70

Datasets
The dataset used for the evaluation of PLI prediction models is of critical importance. The dataset should have credible positive and negative samples (i.e., protein-ligand pairs that interact and protein-ligand pairs that do not interact, respectively). However, most of the datasets applied currently for PLI prediction use randomly generated negative sample 71,72 which creates noise in the data as there might be false negatives (i.e. a sample that is categorized as negative by the model but is positive).
The highly credible negative samples datasets human and C.elegans were created by Liu

Results and discussion
Our model takes the curvature κ and torsion τ for proteins and SMILES strings for ligands as input. The SMILES string is further converted into molecular descriptors utilizing the methods ECF, 50 Avalon, 51 GNN 49 and CLP 44 respectively. DNNs with convolutional neural networks (CNN) have a large number of weights to optimize and therefore requires a large number of data instances to learn. However, in the human and C.elegans datset the number of instances are very few and therefore the proposed model (SSnet) overfits (shown as Figure   2 in the supporting information). To overcome this problem we ignored the convolution layer and directly fed the curvature and torsion of proteins to fully connected dense layer (similar to ligands) for human and C.elegans dataset. Table 1 shows the performance of the SSnet model in the human dataset (balanced dataset with 1:1 ratio of positive to negative samples) when different ligand descriptors were  and unbalanced (1:3 and 1:5) datasets. This suggests that the SSnet model is robust and is able to learn useful information about the protein and ligand pairs. If the model failed to learn robust features, it is likely that the classifier would simply predict the majority class, having a correspondingly low AUC score. We do not observe this.
We further compared our model with PLI specific methods : BLM, 79 RLS-avg and RLS-Kron, 80 KBMF2K-classifier, KBMF2K-regression 81 and GNN-CNN 40 as shown in Figure 4a (performed on the same experimental setting as Liu et al. 73 ). It is important to note that BLM, RLS-avg, RLS-Kron, KBMF2K-classifier, and KBMF2K-regression are modeled on properties such as chemical structure similarity matrix, protein sequence similarity matrix and PLI matrix. Despite such preorganized inputs, SSnet was able to outperform them in terms of AUC. The superior description of proteins in SSnet model was also able to outperform the state-of-the-art model GNN-CNN.  In addition we evaluated our model with the DUD-E dataset as shown in Figure 4b. The best performing molecular descriptor GNN was employed for ligands and branch convolution neural network was employed for proteins. The performance of other molecular descriptors are shown as Table 1   As described in the evaluation criteria section, accuracy and AUC are not the best metric for PLI prediction evaluation. The Receiver operating characteristics (ROC) curve on the DUD-E dataset for each single protein in the test set (30 proteins) is shown as Figure 3 in the supporting information. Figure 5

Latent space for proteins
In order to further test the intrinsic mechanism of our model we analysed the outputs from the final layers in the global max pooling layer (GMP) (Protein concatenate layer in Figure  3). The t-distributed Stochastic Neighbor Embedding (t-SNE) of GMP is shown as Figure 4 in the supporting information. t-SNE is a method to embed high-dimensional points in low dimensions by retaining similarities between points. In this way similar data points forms a cluster and is distinguishable with other data points. We tested the proteins in the test set to note that glyburide is not in the known binding site of the protein and is slightly away from it which acts as an allosteric inhibitor. It has been shown that the presence of glyburide overcomes the toxicity related to drugs binding at the active site of this protein. 87 The fact that SSnet model highlights the portion of the protein where glyburide could have bound (not in the active site) represents one of the various applications that can be carried out with our model. These results also suggest that the SSnet model is learning the relevant information from the features for PLI prediction.

Conclusion
In this work we have approached the prediction of protein ligand interaction via end-toend learning based on the secondary structure of proteins and the molecular description of ligands. The protein's secondary structure is acknowledged as the curvature and torsion of the backbone of protein. The curvature and torsion are comprised of 1D data and therefore has compact information that the machine finds easier to learn. In comparison of 3D or 2D feature representation for proteins, the information provided to the machine is sparse and therefore these models have poor performance. The curvature and torsion have unique It is important to note that SSnet is an ML/DL based method and therefore, is much faster than traditional methods such as Vina 78 or Smina. 83 The SSnet model utilizes secondary structure information of the protein and therefore, does not necessarily require high resolution structural information. For validation and reproducibility all codes developed in this work are publicly accessible at CATCO-Github.
Our study suggest that end-to-end learning models based on the secondary structure of proteins has great potential in bioinformatics which is not just confined to protein ligand prediction and can be extended to various biological studies such as protein-protein interaction, protein-DNA interaction, protein-RNA interactions etc. We leave these explorations of the SSnet model for future work.

Associated content
Model overview, training loss on human dataset, results obtained by the SSnet model for all molecular descriptors and statistical results on DUD-E dataset, and Grad-CAM analysis on various proteins.