Diagnostics for Statistical Variable Selection Methods for Prediction of Peptic Ulcer Disease in Helicobacter pylori Infection

Background The development of accurate classification models depends upon the methods used to identify the most relevant variables. The aim of this article is to evaluate variable selection methods to identify important variables in predicting a binary response using nonlinear statistical models. Our goals in model selection include producing non-overfitting stable models that are interpretable, that generate accurate predictions and have minimum bias. This work was motivated by data on clinical and laboratory features of Helicobacter pylori infections obtained from 60 individuals enrolled in a prospective observational study. Results We carried out a comprehensive performance comparison of several nonlinear classification models over the H. pylori data set. We compared variable selection results by Multivariate Adaptive Regression Splines (MARS), Logistic Regression with regularization, Generalized Additive Models (GAMs) and Bayesian Variable Selection in GAMs. We found that the MARS model approach has the highest predictive power because the nonlinearity assumptions of candidate predictors are strongly satisfied, a finding demonstrated via deviance chi-square testing procedures in GAMs. Conclusions Our results suggest that the physiological free amino acids citrulline, histidine, lysine and arginine are the major features for predicting H. pylori peptic ulcer disease on the basis of amino acid profiling.


Introduction
The analysis of high-dimensional data, where the number of predictors exceeds the sample size, poses many challenges for statisticians and calls for new statistical methodologies in order to select relevant variables in multivariate data, feature selection is used to overcome the curse of dimensionality by removing non-essential variables to achieve a model with predictive accuracy. Consequently, the choice of a variable selection procedure becomes very important for enhancing the ability to generate reproducible findings and generalizable conclusions. In high-dimensional data it is desirable to have parsimonious or sparse representations of prediction models. Since highly complex models are penalized by increased total error, regularization helps reduce complexity in classification by minimizing over-fitting of the training data. We evaluated this by maximizing goodness-of-fit and simultaneously minimizing the number of variables selected.
In this study, we evaluated different models by randomly selecting and withholding the training data to be used later for testing. The area under the receiver characteristic operating curve (ROC) was used as a measure for comparing prediction accuracy based on sensitivity and specificity for both training and test data. In this study discriminative features were identified that associated with H. pylori peptic ulcer disease. We found that various free amino acid measurements could be associated with disease outcome. However, many of these variables are highly correlated and which of the factors will result in the most stable classifier is unknown. Here we sought to extend this work by comparing the effects of various feature reduction methods.

Study design
Sera were obtained from patients with documented H. Pylori infection undergoing an endoscopic exam for routine medical management. Thirty subjects with proven duodenal ulcer were matched using sera collected during the same time period. Samples were collected after receiving written informed consent as approved by the Institutional Review Board of Baylor College of Medicine, Houston, TX and the study was conducted in accordance with the guidelines of the Helsinki Declaration.

Amino acid analysis
Serum was precipitated by mixing equal volumes of 7.5% sulfosalicylic acid (SSA) in 0.02N hydrochloric acid (HCl). Aminoelthyl cysteine (AEC) was added as the internal standard in each sample. The precipitate was centrifuged for 15 min at 10,000 × g, and supernatant containing physiological amino acids was saved. The supernatant was quantified for amino acids using a Hitachi L8800 Amino Acid Analyzer. Data was reported as nmol amino acid per 10 μl plasma. Samples were run in duplicate and concentrations varied less than 10% between duplicates. For measurements below the limits of detection, samples were imputed with 1/10 the lower limit of detection for the assay.

Feature reduction-modeling strategies
In this paper we use the term feature selection methods to refer to identifying the subset of differentially-expressed predictors that are useful and relevant in distinguishing different classes of samples. Similarly, model selection is a process of seeking the model from a set of candidate models that are the best balance between model fit and complexity. Our research goal is to evaluate the various model fittings of increasing data complexity and to find the best models to identify the underlying model by both AIC, BIC, and cross-validation.

Feature reduction using Significance Analysis of Microarray (SAM)
SAM determines significance by using more robust test statistics and permutations to estimate false discovery rate instead of the conventional "t" distribution level of significance [1]. Efron et al. [2] developed an empirical bayesian approach using non-informative priors and deriving the posterior probability difference for each of the predictors without having to run t-tests or Wilcoxon tests to identify those that were differentially expressed. In some cases, a heuristic approach was investigated for feature selection by integrating correlation, histogram, and other statistical measures. We used the information criterion on the modeling.
Akaike information criterion (AIC) is given by (1) where L(M) is the maximum likelihood function of the parameters in the model M, and p is the number of covariates estimated in the model [3].
Bayesian information criterion (BIC) is given by (2) where n is the sample size, M and p are defined as those variables shown in Equation 1 [4,5]. Specifically, Stone [6] showed that the AIC and leave-one out cross validation are asymptotically equivalent.

Logistic regression with regularization
Logistic modeling has a binary response y i ∈ {0, 1}, and assumes that (3) Regularized and shrinkage estimation methods such as a LASSO (least absolute shrinkage and selection operator) estimator helps address variable selection and multicollinearity. For a binary response variable and the logistic regression models, the LASSO estimator is estimated by penalizing the negative log-likelihood with the L 1 -norm. The penalty term is chosen by a cross-validation technique to evaluate the out-of-sample negative log-likelihood [7]. The EN (Elastic Net) penalty is designed to simultaneously select strongly correlated variables that combine the L 1 and L 2 penalizing terms in the model [8].
The coefficient vector β that minimizes the penalized log-likelihood is (4) where p i = Pr(y = 1|x). Fan and Li (2001) proposed the Smoothly Clipped Absolute Deviation (SCAD) penalty, which compromises between L 1 and L 2 , and the L 0 selection methods. The SCAD penalty deletes small coefficients and keeps large coefficients unshrunken, but sacrifices continuity and stability. The SCAD penalty can provide a smaller bias in coefficient estimation than LASSO because it is bounded as a function of β. The SCAD penalized estimator also has an oracle property [9]. Sparse regression using penalization is one of the most popular tools for analyzing high dimensional data.

Generalized Additive Models (GAMs) for classification problems
GAMs provide a general frame work moving beyond linearity by allowing nonlinear functions of each of the variables, while maintaining the additive assumption [10]. Logistic regression GAM modeling has a binary response y i ∈ {0, 1}, and assumes (5) GAM modeling allowed us to fit a nonlinear function to each predictor, so that we could automatically model the nonlinear relationship that standard linear regression will miss. The nonlinear fits can potentially make accurate predictions for the outcome.

Multivariate Adaptive Regression Splines (MARS)
The MARS method of Friedman [11] is a nonparametric regression method that estimates complex nonlinear relationship by a series of truncated spline functions of the predictors [12]. The basis functions are combined in the MARS model as a weighted sum of (6) where ŷ is the response described by the model, a 0 the coefficient of the constant basis function (intercept), p the total number of basis functions and a k , the coefficient of the k th basis function B k (x). MARS models use hockey stick basis functions of the form (x − t) + and (t − x) + , with t being the knot. The basis functions in MARS are single truncated spline functions or a product of two or more spline functions for different predictors. The first order MARS model was built without interactions to over-fit the training data. A maximum number of basis functions equal to 30 was used as the stopping criterion. The model was pruned using a ten-fold generalized cross validation. The optimal model was selected based on evaluation of the model complexity and its predictive quantities for the test sets. Software implementation of the MARS model is available in Salford Predictive Modeler version 7.0 from Salford Systems.

Bayesian variable selection for GAM
Bayesian variable selection is an approach designed to assess the robustness of results, in terms of alternatives, by calculating posterior distribution over coefficients and models. One of the most popular approaches is to assume a spike-and-slab mixture prior for each coefficient, with one component being a narrow spike around the origin that imposes very strong shrinkage on the coefficients and the other component being a wide slab that imposes very little shrinkage on the coefficients [13]. The posterior weights for the spike and the slab can then be interpreted analogously. To select the models of predictors between smoothing nonlinear terms and linear effects, we performed Bayesian variable selection in GAMs implemented in the R package spikeSlabGAM [14]. Bayesian GAMs produce a posterior probability for each possible model in addition to one for each predictor. Using Bayesian GAMs, model uncertainty can be incorporated into conclusions about parameters and predictions. Thus, we have to consider all possible models that fit. Bayesian GAMs can be applied using the R library BMA (http://cran.r-project.org).

Nonlinear testing procedures
We assessed the linear or non-linear association of binary response variables of selected variables in each model. Investigation of model predictors and their linear association was determined using GAMs. We evaluated the partial residual plot as a diagnostic graphical tool for identifying nonlinear relationship between the response and covariates for generalized additive models [10,12,15]. For each part predictor, we also examined the loglikelihood ratio test p-values, comparing the deviance between the full model and the model without that variable. We calculated the projection (hat) matrix, Cook's distance, various residuals and the estimated probabilities versus each predictor to evaluate outliers and identify influential points in the models. We used both the change in residual deviance (as in parametric or nonparametric models), and the ROC to compare the performances of the statistical models.

Descriptive statistics
Sera from 30 subjects with H pylori infection, without endoscopy-documented mucosal ulceration, and 30 subjects with H pylori infection and peptic ulceration were studied. Concentrations of the free amino acids were measured in each subject. The concentration of each amino acid by peptic ulcer disease (PUD) status is shown in Table 1. The concentrations of taurine (0.15 ± 0.04 no PUD vs 0.19 ± 0.08 with PUD, p<0.05), urea (42.29 ± 11.41 no PUD vs 54.17 ± 22.42 PUD, p<0.05), glycine (3.97 ± 0.64 no PUD vs 4.64 ± 1.09 with PUD, p<0.05), citrulline (0.34 ± 0.1 no PUD vs. 0.54 ± 0.14 with PUD, P<0.001) were significantly different. Of note, all amino acids were elevated in the subjects with PUD, indicating upregulation of the urea cycle.

Parametric and nonparametric modeling
Our objective was to create a serum biomarker panel of amino acids that predict the occurrence of PUD. Table 2 shows the sparse regression coefficients with the LASSO, EN, and SCAD penalty which contain important variables and the model selected by using a BIC criterion. Because the underlying data structure dictates the selection of an appropriate modeling approach, we analyzed the contributions of parametric (linear) or nonparametric (spline) features using Bayesian variable selection. This method produces a hierarchy of structured model selections for parametric and nonparametric relationships to the PUD outcome for each feature. The posterior probabilities for the linear and spline components are shown in Table 3. From this analysis, the linear and spline component of citrulline and the spline component of histidine were significant ( Table 4).
As an additional analysis, we examined the relationships of amino acids to outcome using GAMs. Inspection of the GAM plots indicates that the partial residuals are nonlinear ( Figure  1). For example the partial residual plot of citrulline shows a linear component at low citrulline concentrations until a concentration of 0.5 is reached, at which time the curve sharply inflects to a horizontal line. A similar inflection is seen in other variables. From this analysis, we concluded that the modeling of PUD requires a nonparametric approach.
For the nonparametric modeling we applied MARS, an additive modeling technique that uses piecewise linear spline functions (basis functions) as predictors. MARS uses a twostage process for constructing the optimal classification model. The first half of the process involves addition of basis functions until a user-specified number of basis functions have been reached. In the second stage, MARS deletes basis functions in order, starting with the basis function that contributes the least to the model until an optimum model is reached.

Model performance
The optimal MARS identified four informative amino acids (citrulline, histidine, lysine and arginine). Evaluation of the model performance is evaluated in several ways. The accuracy of prediction was evaluated using a confusion matrix. The model produced an overall accuracy of 91.67%, with a 96.67% ability to correctly identify PUD ( Table 5).
The second analysis involved evaluation of the ROC, where sensitivity vs. 1-specificity was plotted. In the ROC analysis, a diagonal line (45 degree slope) starting at zero indicates that the output was a random guess, whereas an ideal classifier with a high true positive rate and low false positive rate would curve positively and strongly towards the upper left quadrant of the plot. The AUC is equivalent to the probability that two cases, one chosen at random from each group, are correctly ordered by the classifier. The AUC of the MARS model predictor was 0.9656 (Figure 2), suggesting that the model performed in a highly sensitive and specific manner.
The relative contribution of each amino acid to the overall performance of the classifier was evaluated by the variable importance, a relative measure of the effect of removing a feature on the model accuracy. Here, citrulline was the most important variable (variable importance of 100%), histidine and lysine were less important but similar (31.5% and 27.3%, respectively) and arginine was least important (11.6%, Figure 3).
Finally, the basis functions (BFs), which are combinations of independent variables in the model, are shown in Table 6. Importantly, we note no interaction terms, minimizing the potential for the model to have over-fitted the data. The addition of two BFs for citriulline corresponds well to the inflection of citrulline in the GAM analysis (cf. Figure 1).
The distributions of concentrations of the amino acids by disease classification are shown in the box plots of Figure 4. A nonparametric relationship is seen for each.
These data indicate that physiological concentrations of amino acids are perturbed by H pylori induced PUD, and combinations of citrulline, histidine, lysine and arginine can be used to predict PUD using nonparametric modeling. The residuals for modeling GAM fitting provide information for modeling checking in the GAM check plot of Figure 5.

Conclusion
We evaluated parametric and non-parametric modeling approaches (GAMs, Bayesian GAMs and MARS) for discovering specific physiological free amino acids as biomarkers for H. pylori-associated peptic ulcer disease. These studies may have potential benefits by providing non-invasive identification of individuals at risk for clinically significant ulceration and for institution of appropriate targeted therapy. This study also suggests hostinteraction pathways (amino acids) related to the pathogenesis of peptic ulcer in H. pylori infected patients. Interestingly, in Crohn's disease, a gastrointestinal inflammatory disease, serum citrulline was found to be inversely correlated with each other [16]. The presence of amino acids that correspond to the urea cycle could reflect the presence of a urea cycle and a highly active urease in H. pylori. However, differences the urea cycle amino acids in individuals who are infected but do not have PUD versus those that have PUD are not yet established, but could reflect differences in infected individuals, the infecting strains or both. We recognize that our study is limited because of the relatively small data set. Further evaluation of this modeling procedure on a large independent data set is needed.  ROC analysis. Shown is a Receiver Operating Characteristic (ROC) curve for the predictive model for peptic ulcer disease. Y axis, Sensitivity; X axis, 1-Specificity.  For each variable shown, the distribution of each predictor is divided over case (with ulcer) and control (without ulcer). PUD: endoscopy-documented peptic ulcer.  Sparse penalized logistic regression coefficients.