PREDTAP: a system for prediction of peptide binding to the human transporter associated with antigen processing

Background The transporter associated with antigen processing (TAP) is a critical component of the major histocompatibility complex (MHC) class I antigen processing and presentation pathway. TAP transports antigenic peptides into the endoplasmic reticulum where it loads them into the binding groove of MHC class I molecules. Because peptides must first be transported by TAP in order to be presented on MHC class I, TAP binding preferences should impact significantly on T-cell epitope selection. Description PREDTAP is a computational system that predicts peptide binding to human TAP. It uses artificial neural networks and hidden Markov models as predictive engines. Extensive testing was performed to valid the prediction models. The results showed that PREDTAP was both sensitive and specific and had good predictive ability (area under the receiver operating characteristic curve Aroc>0.85). Conclusion PREDTAP can be integrated with prediction systems for MHC class I binding peptides for improved performance of in silico prediction of T-cell epitopes. PREDTAP is available for public use at [1].


Background
Peptides that bind major histocompatibility complex (MHC) class I molecules serve as recognition targets for cytotoxic CD8 + T cells (CTLs). The major function of CTLs is recognition and destruction of infected (e.g. viruses, bacteria, parasites or fungi), mutated (e.g. cancer), or foreign (e.g. transplants) cells. CTLs recognize short antigenic peptides (T-cell epitopes) presented by MHC class I molecules that mainly originate from degradation of cytosolic proteins. Intracellular antigen processing pathways determine the selectivity of peptides which are available for binding to MHC class I molecules and are thereby important targets of CTL responses [2].
MHC class I antigen processing pathway steps include proteosomal cleavage of proteins into shorter peptides, translocation of peptides into the endoplasmic reticulum (ER) by TAP, optional ER trimming by aminopeptidases, insertion of peptides into the binding groove of MHC molecules, and transport of peptide/MHC complexes to the cell surface for presentation to CTLs [3]. TAP is a transmembrane protein responsible for the transport of antigenic peptides into the ER. TAP demonstrates peptide binding selectivity and the affinity of a particular peptide for TAP influences the probability of its presentation by MHC class I molecules. Peptides that are 8-16 amino acids long and have sufficient binding affinity are efficiently translocated by TAP into the ER, while longer peptides may be transported but with lower efficiency [4]. Human TAP (hTAP) is a heterodimer that has two subunits hTAP1 and hTAP2. TAP belongs to the ATP-binding cassette transporters and each subunit protein has one transmembrane domain and one ATP-binding binding domain. The genes for human TAP1 and TAP2 are located in the MHC II locus of chromosome 6 and comprise 10 kb each [5]. A more detailed description of function, structure, expression of TAP can be found in [6].
The efficiency of TAP-mediated translocation of a peptide is proportional to its TAP-binding affinity [7,8]. Mutations, such as premature stop codons, or deletions of either hTAP1 or hTAP2 impair peptide transport into ER and result in a significant reduction of surface expression of peptide/MHC complexes [9]. TAP deficient cells have low cell-surface HLA class I expression shown to range from 10% (HLA-A2) to 3%, (HLA-B27 and -A3) [10]. The majority of the peptides presented by HLA class I on cell surface are thus dependent on TAP.
Identification of T-cell epitopes is a highly combinatorial problem. The diversity of human immune responses to Tcell epitopes originates from two sources -high allelic variation of the host (both HLA molecules and T-cell receptors) and high variation of target antigens, particularly those derived from viruses. Computational models are routinely used for pre-screening of potential T-cell epitopes and minimization of the number of necessary experiments. Most developments have focused on modeling and prediction of peptide binding to MHC molecules [see [11]]. Amongst computational models of peptide binding to hTAP that have been developed are binding motifs [7], quantitative matrices [12][13][14], artificial neural networks (ANN) [12,15], and support vector machines (SVM) [16]. Combined computational methods that integrate multiple critical steps -proteasome cleavage, TAP transport, and MHC class I binding have been proposed as a supporting methodology for prediction of high probability targets for therapeutic peptides and vaccines [17]. Several combined computational applications of models of antigen processing and presentation have been reported [18][19][20][21][22]. Testing results indicate that these pre-dictions produce a lower incidence of false positives and reduce the number of experiments required for identification of T-cell epitopes. However, these combined predictions need to be taken with a dose of caution. Alternative pathways for both proteolytic degradation [23] and TAP transport [24] have been reported. In some cases TAPdeficient individuals have normal immune responses [25], suggesting that TAP-independent immune responses are sufficient to provide effective protection from some intracellular pathogens. Nevertheless, the proteasome-TAP-MHC class I pathway is responsible for 90-97% of expression of peptide/MHC Class I complexes and therefore is critical for the identification of target epitopes for immunotherapies and vaccines.
We developed PRED TAP , a computational system that predicts peptides binding to hTAP. It uses ANN and hidden Markov models (HMM) as predictive engines. Extensive testing was performed to validate the prediction models and ensure that PRED TAP is both sensitive and specific. PRED TAP is available for public use at [1].

Training dataset
There are 493 nonamer peptides in the training dataset (Table 1) [12,15]. A single duplicate peptide was removed from the data set reported in the original references. The binding scores range from zero to ten. Scores 7-10 denote high peptide/TAP binding affinity, 5-6 moderate binding affinity, 3-4 low binding affinity and scores 0-2 denote non-binding. The dataset is available in the supplementary materials.

Artificial Neural Network
3-layer backpropagation ANN models (in-house software) were used for the development of the PRED TAP server. The learning method was error backpropagation with a sigmoid activation function. The inputs to the ANN were the binary strings representing nonamer peptides. There are twenty naturally-occurring amino acids encoded by the standard genetic code. Each amino acid in a nonamer peptide can be encoded as a binary string of length 20 with a unique position set to "1" and other positions set to "0", resulting in a binary string of length 180 to represent the nonamer. For example the first two amino acids, by alphabetic order, alanine (A) and cysteine (C) are encoded by 10000000000000000000 and 01000000000000000000 respectively, and the last amino acid tyrosine (Y) is encoded by 00000000000000000001. The outputs were binding scores ranging from zero to ten. The higher the score, the higher the possibility of the peptide being a TAP binder. Two ANN architectures were used, 180-2-1 and 180-1-1. The maximum number of the ANN training cycles was set to 300. The training was repeated for four times, and four sets of weights were obtained. The value of momentum was 0.5 and of learning rate 0.2. The error threshold for stopping training was 0.01.

Hidden Markov Model
HMMs have been applied successfully in prediction of HLA class I-binding peptides [26,27]. An HMM is defined by a finite set of states representing possible states of the modeled system. Some of these states may be directly observable, but some are not, and are denoted as hidden. Biological problems are often sequential and HMM frequently utilize sequential ordering of system states. A change (transition) of the system from one state to another is governed by statistical regularities. The probability distribution of the system states can be estimated from the data. In the present study, we used a first-order HMM, in which the current system state is determined only by the preceding state, as described in [26].

Cross-validation
Cross-validation is a method for error rate estimation. It implements a simple idea: the dataset of size n samples is partitioned into two parts, the model parameters are estimated using one set and the goodness-of-fit criterion evaluated on the second set. The cross-validation estimates the goodness-of-fit criterion. Cross-validation tends to overfit when selecting a correct model -it may choos an overlycomplex model for the given dataset. There is some evidence that for model selection multifold cross-validation, where more than one samples are deleted form the training set in each comparison, performs better than a simple leave-one-out cross-validation [28]. In our experiments, 10-fold cross-validation was performed to evaluate the performance of the classifiers.

Prediction performance measurement
The predictive performance of the models was evaluated by sensitivity (SE) and specificity (SP) measures. Sensitiv-ity, SE = TP/(TP+FN), indicates percentage of correctly predicted binders, where TP stands for number of true positive predictions (experimental binder predicted as binder) and FN stands for number of false negative predictions (experimental binder predicted as non-binder). Specificity, SP = TN/(TN+FP), indicates percentage of correctly predicted non-binders, where TN stands for number of true negative predictions (experimental non-binder predicted as binder) and FP stands for number of false positive predictions (experimental non-binder predicted as binder). For the studied problem, we consider values of SP >0.8 useful in practice.
The receiver operating characteristic (ROC) curve analysis provided a measure for overall prediction accuracies of prediction models [29]. The ROC curve is generated by plotting SE against (1-SP) for various classification thresholds. As a rough guide, the area under ROC (Aroc) value 1.0 represents a perfect prediction, values 0.9 to 1.0 represent excellent accuracy, 0.8 to 0.9 represent good accuracy, 0.7 to 0.8 represent marginal accuracy, 0.5 to 0.7 represents poor accuracy, while 0.5 represent predictions that indicate random choice [29].

Normalization of prediction scores
Brusic et al. [15] showed that ANN models were skewed with a tendency to center-shift prediction of both very low and very high TAP binders. To obtain prediction scores evenly distributed in the range 0-10, we have implemented prediction score normalization. The raw prediction scores produced by HMM methods are not within the range 0-10. Score mapping is also necessary to bring final prediction scores within the range 0-10. The mapping of scores was done according to equation: scoren = (score -scoremin) / (scoremax -scoremin) × 10 score n denotes the normalized score, score denotes the raw prediction score, score min and score max denote the minimum and maximum values of the raw scores. The values for score min and score max were obtained using extensive simulation. More than 5000 randomly selected nonamer peptides were used for prediction using the ANN/HMM models. Since the testing data contains large number of nonamer peptides, the highest and lowest predicted score from the testing data were taken as reasonable maximum and minimum scores for normalization.

Implementation
The web interface of PRED TAP uses a set of Graphical User Interface forms. The interface was built using a combination of Perl, CGI and C programs. PRED TAP has been implemented in the SunOS 5.9 UNIX environment.

Model validation
Assessment of predictive accuracy was carried out for three subsets of peptide binders: 1) all binders including low, moderate and high binders were considered as positive samples, and all non-binders as negative samples (referred to as the LMH set); 2) moderate and high binders were considered as positive samples, all non-binder and low binders as negative samples (referred to as the MH set), and 3) only high binders were considered as pos- Sensitivities and specificities of ANN and HMM models at various thresholds (based on normalized scores) in 10fold cross-validation experiments are shown in Figures 1  and 2. We selected the normalized score of 6.0 as a reasonable selection threshold, with peptides with scores ≥ 6.0 predicted as TAP binders. In Table 3, the sensitivities and specificities of ANN and HMM models at the selection threshold 6.0 are shown. ANN model managed to correctly predict 88% of high binders at the cost of 11% of false positives (the 11% also includes moderate and lowaffinity binders); 67% moderate and high binders with 3% false positives in the MH set, and 50% of all binders (low, moderate and high) with practically no false positives (Table 3A). The specificities of ANN model for all three sets (LMH, MH and H sets) are high (1.00, 0.97, 0.89 respectively), which indicates that 6.0 is a stringent selection threshold and the false positive rate is very low at this threshold. At threshold 6.0, HMM model managed Plot of sensitivity and specificity of ANN model against thresholds in 10-fold cross-validation to correctly predict 91% of high binders with 32% false positives, 81% moderate and high binders with 19% false positives, and 66% of all binders (low, moderate and high) with 14% false positives (Table 3B). The specificity of the HMM model for LMH set was 0.86, higher than that of MH set which was 0.81. The specificity of the HMM model for MH set is much higher than that of H set, which was 0.68. It implies that HMM model was able to select binders (low, moderate and high binders) with low false positive rate, but it failed to categorize them into subgroups -low, moderate or high binders.
To evaluate the predictive power of the methods, the dataset was partitioned into a training set containing two thirds of the data points randomly selected and a testing set containing the remaining one third of data points. The tests were conducted three times for each ANN and HMM methods. The Aroc values of ANN and HMM models are shown in Table 4. Despite smaller training datasets being used ANN models continued to show excellent performance with Aroc values above 0.9 for H and MH sets and good performance with Aroc values above 0.85 for LMH set. The performance of HMM model is also good with Aroc values above 0.85. The performance of HMM dropped slightly with Aroc values above 0.85 for H and MH sets and above 0.80 for LMH set.

Comparison to other predictive systems
Since PRED TAP , TAPPred and SVMTAP were built using the same set of training data [12,15], independent data sets must be used to test and compare their prediction performance. Rather, we compared the predictions on human papillomavirus type 16 E6 and E7 and the amino acid positions of top 5% predicted TAP binders were shown in Tables 5 and 6. Half of the experimental HLA-A3 binders overlapped predicted TAP-binders. As suggested by previous studies [15,32] HLA-A3 binding peptides have high affinity to TAP, in agreement with our results. The SVMTAP, TAPPred (SVM), and PRED TAP (ANN & HMM) predicted similar sets of TAP-binding peptides while TAPPred (cascade SVM) predictions were different (Table 5). A single HLA-A3 binder from E7 protein did not overlap any of predicted TAP binders except for TAPPred (cascade SVM) ( Table 6). Again, the TAPPred (cascade SVM) predicted completely different set of peptides as compared to the other four predictors.
Three naturally processed peptides from tumor antigen KM-HM-1, namely 196-204, 499-508, and 770-778, are naturally processed by HLA-24 [31]. HLA-A24 binding peptides have been reported as TAP efficient [15,32]. KM-HN-1 protein is 833 amino acids long, and we used top 3% of the predictions (Table 7). Peptide 195-203, which has 8 amino acids overlap to the KM-HN-1196-204, was selected by SVMTAP, TAPPred (SVM) and PRED TAP (ANN  & HMM), but not by TAPPred (cascade SVM). Peptide 499-508, was selected by the four methods as a potential 16-mer, also as a 12-mer by PRED TAP (ANN), but not by TAPPred (cascade SVM). It was shown that some peptides are efficiently transported by TAP in their optimal size for MHC class I binding, while some peptides are transported as larger peptides that need further trimming in ER for MHC class I binding [33]. It is likely that peptides 196-204, 499-508, and 770-778, are transported to ER in the longer form and then further trimmed for loading to the HLA-A24 molecules.

Using PRED TAP
To perform predictions using PRED TAP , the user needs to paste a protein sequence into the textbox and assign a name to the sequence. The sequence must contain between nine and 2000 amino acids. If the prediction is run with input sequence containing symbols other than 20 amino acid codes (spaces and carriage returns are allowed) or the total sequence length is outside 9-2000 amino acids range, an error message will be displayed and predictions will not be produced. The input can either be a contiguous protein sequence (an amino acid sequence, or FASTA format) or a list of peptides, one per line. The default selection on the webpage is "Protein sequence" (Figure 3A), which means the input sequence is treated as a contiguous protein sequence (carriage returns and line breaks will be ignored). The PRED TAP input processing program decomposes protein sequence (or the list of pep-tides) into a series of 9-mer peptides overlapping by eight amino acids. Individual 9-mer peptides are then submitted for prediction. Predicted binding scores for all 9-mers are displayed in the result tables ( Figure 3B). The 9-mer binding scores are within the range 0-10, the higher the score the higher the probability of peptide being binder. PRED TAP has an option for plotting the binding scores of all the overlapping 9-mer peptides as a graph, in which X axis represents the start position of a 9-mer peptide and Y axis represents the binding score of the 9-mer peptide. The user can sort the peptides by their binding scores and choose to view only predicted binders with binding scores above a certain threshold ( Figure 3C).
When users select the input sequence type to be "a list of peptide sequences", the input sequences separated by carriage returns or line breaks are treated as different peptides ( Figure 4A). All overlapping 9-mers in each peptide are submitted for prediction. In the result tables, predicted binding scores are represented by the highest individual 9mer binding score within the input peptide. The 9-mer with the highest binding score in each peptide is displayed as "Binding Core" in the result table. The user can sort the peptides by their binding scores ( Figure 4B).

Discussion
We have earlier compared four prediction servers for prediction of H-2K d binding peptides [34]. A 121-amino acid long sequence of the nuclear export protein NS2 from  The combinatorial properties of molecular mechanisms involved in antigen processing and adaptive learning nature of the immune responses limit our ability to fully predict immune responses. Combining experimental and computational techniques improves our ability to deci-pher complex interactions of the immune system. Computer models are used to complement laboratory experiments and thereby speed up knowledge discovery in immunology. In particular, the number of large-scale laboratory experiments for T-cell epitope mapping can be minimised by the judicious use of experiments aimed at developing and validating computer models. These models can then be used to perform large-scale computer simulations rapidly and inexpensively. The hypotheses generated from these experiments can then be retested in the laboratory to confirm their applicability to real-life immunology. Further work will include both the refinement of computational models and scanning diseaserelated antigens for peptide sequences that show high probability of processing and presentation. Those peptides that are most likely to be produced by proteasomal cleavage, transported by TAP, and bound by HLA class I molecules are likely to be promising candidates for peptide-based CTL vaccines. The PRED TAP server provides for the prediction of peptide binding by TAP and can be used as a comparison method against other TAP-prediction servers.  The examples of the output pages of PRED TAP for a single protein Figure 3 The examples of the output pages of PRED TAP for a single protein. The sequence type chosen is "protein sequence". A) The input page. B) The main result page. The input sequence is decomposed into overlapping 9-mers for prediction of binding scores to TAP. C) Alignment view of the predicted TAP binding regions in the input protein.