BREAKING THE CODE: Predicting Where Disease Will Strike

Health authorities today enjoy an embarrassment of riches when it comes to data on disease incidence, thanks to the rapidly increasing use of electronic record keeping at hospitals, clinics, and related businesses such as pharmacies. Where significant numbers of paper files once required days or weeks to process for survey purposes, much larger volumes of material on patient conditions can today be assembled in a matter of hours. This growing abundance of information has raised expectations about the prospect of identifying local outbreaks of disease at ever earlier stages so that such outbreaks may be addressed and contained as soon as possible—expectations that are perhaps more easily raised than met, given statistical realities. 
 
One of the most recent strategies for harnessing these data is offering some promising results, however. A paper that appeared in the March 2005 edition of PLoS Medicine outlines a new methodology that requires only that researchers know the actual number of disease cases occurring in a given area over a given period of time. The resulting statistical model makes only the most limited number of assumptions surrounding the emergence of a disease, while attempting to compensate for naturally occurring temporal and geographical variations in disease reporting. 
 
This offers greater detail than one-dimensional statistical methods, which track disease outbreaks in either purely temporal or purely geographic terms. Above all, the new model eliminates the need for information about the local population and its relative risk for disease—for example, whether a neighborhood contains a higher-than-average proportion of groups, such as infants or the elderly, who may be prone to specific ailments.


Background
In this work, we analyzed the slope of the cholesterol increase with age in the simulated data (Problem 2). Our objective was to identify the markers that are linked to the disease genes related to a high rate of increase in cholesterol. Genetic Analysis Workshop 13 provided information that the disease genes are located on chromosomes 7(s7), 15(s8), and 21(s9), respectively, and that the gene on chromosome 21(s9) only affects cholesterol rate in the females, i.e., it interacts with gender. The Haseman-Elston [1] method allowed one to apply linear regression meth-ods for linkage analysis. For each sibling pair, it used the number of alleles identical by descent (IBD) at each marker as the explanatory variables and a statistic measuring similarity of values of the quantitative traits in the sibling pair as the response variable. The original Haseman-Elston method [1] used the squared difference between the traits of the siblings. In a recent publication, Elston et al. [1] proposed the cross-product of the two trait values in a sib pair as the response, which was used in this paper. Suh et al. [2] applied Stochastic Search Variable Selection (SSVS), a Bayesian variable selection method proposed by George and McCulloch [3] for the linear regression model, to the Haseman-Elston method. Although the scope of Suh et al was very preliminary, with only the IBD values at the linked markers plus 10 unlinked markers used as candidate explanatory variables in the variable selection, it showed the Bayesian variable selection approach to be very promising. The study presented here extended these methods in two respects. First, we took advantage of SSVS by including all 399 markers as candidate explanatory variables. It is computationally impossible to consider all subsets of 399 markers using a traditional frequentist approach. Secondly, a hierarchical prior probability structure as discussed by Chipman [4] was imposed on the model space to study the interaction effects (epistasis). The results were reported and compared with those obtained with the more traditional forward and backward step-wise regression.

Haseman-Elston method
We chose to analyze the rate of change in cholesterol over time in the simulated data. First, for each individual, we obtained the least square (LS) estimate for the slope of cholesterol over the time. For the i th sibling pair, using the LS estimate of slope as the trait (Y 1i , Y 2i ), we computed their cross-product CP i = (Y 1i -m)(Y 2i -m) as our response values, where m is the mean of the slopes over all siblings in the same family. Elston et al. [1] introduced the crossproduct CP, as the replacement of the squared difference . In our regression analysis, we adopted CP as the response, and also used squared-difference for comparison. For simplicity, we assumed the errors to be independent but a correlation structure could be implemented into our method in a straightforward way.
There are about 1500 full sib pairs and a few half sib pairs in each replicate. In the replicate we considered there are 1522 full sib pairs. The number of alleles shared in each pair was obtained for each sib pair at each marker using the SIBPAL program of the SAGE software [5]. There were a total of 399 markers. We had where the ε ~ N(0, σ 2 ) were assumed to be independent and X values were IBD scores.
To study the effect of gender, we also included the genders of the siblings as an explanatory variable. It was in fact coded as two dummy explanatory variables as follows: (male, male) = (0, 0), (male, female) = (0,1), and (female, female) = (1,1). [3] proposed a Bayesian model selection method for variable selection based on the Gibbs sampler. The criterion of interest was taken to be the posterior probability of a model conditional on the data that could be obtained using the stochastic search variable-selection. For the simplest case of linear regression with normal errors:

George and McCulloch
where β may contain main effects or interactions effects. They set the prior distribution of β as mixtures of two normal distributions by introducing the latent variable γ: where much larger variance (c > 1) allowed for γ k = 1 to have a large influence. A recommended choice for these parameter values is given by George and McCulloch [3]. The value of c was set equal to 10 in our analysis. A model was represented by a vector γ = (γ 1 , γ 2 ,..., γ p ), where γ k = 0 or 1. If γ k = 0, then the marker X j was considered to be excluded from the model and if γ k = 1, it was considered to be included in the model. Note that β 0 was taken to be always included, thus we could set β 0 N((0, c 2 τ 2 ). With appropriate prior on γ = (γ 1 , γ 2 ,..., γ p ) and σ 2 , we obtained a posterior distribution of γ using Gibbs sampling. Therefore, by examining the posterior probability of γ, we identified the optimal model with the largest posterior probability and rank the markers using the marginal distribution of each γ k . A prior for γ corresponds to a prior on the model. The commonly used independence prior implies that the importance of any variable is independent of any other variable. In other words, under this prior, each X i enters the model independently of the other coefficients, with probability p(γ i = 1) = 1 -p(γ i = 0) = p i . A smaller p i can be used to downweight X i values that are costly or of less interest. For our case, a useful reduction was to set p i = p, in which p is the a priori expected proportion of X i values in the model. When only main effects but no interaction were considered, the importance of any variable was independent of the importance of any other variable. Thus the independence prior implied that the prior of γ was simply set as prob(γ) = p n , where n is the number of ones in γ. Increased weight on parsimonious models could instead be obtained by setting p small. So in our case, p was set to be small, 0.02 first, and next to see how our method is robust to this choice of p, we chose a new value of p = 0.002 for comparison. The details on the MCMC algorithms can be found in George and McCulloch [3].
We applied SSVS to select markers linked to cholesterol rate from all 399 markers under consideration. Since it is  Top ranked markers based upon SSVS Figure 1 Top ranked markers based upon SSVS The markers are ranked according to their marginal posterior obtained using 10,000 cycles of Gibbs algorithm. Disease loci are located on chromosomes 7, 15, and 21, and gender effect is ranked at the 15 th (Replicate 1 of the simulated data).
We followed the Markov chain Monte Carlo (MCMC) algorithms described in Chipman [4] and implemented it using the JAVA programming language. The programs were run on a Linux cluster using Intel processors. The length of the MCMC chain was set to 10,000. The running time was approximately 30 minutes on a single 1.0 GHz CPU under the above specified environment. The first 1000 samples were used as the burn-in period and not included in estimating the posterior.

Hierarchical prior structure
When interaction effects (epistasis) are considered in the model selection, the model space becomes enormous and the common independence prior for γ is not appropriate anymore. With interactions, the prior for gamma can cap-ture the dependence relation between the importance of a higher order term and those lower order terms from which it was formed. Chipman [4] proposed a hierarchical prior structure for this model space. The importance of the interactions such as X i X j will depend only on whether the main effects X i and X j are included in the model. This belief can be expressed by a prior for of the form γ γ γ γ = ( , , ) The probability that the term X i X j is active may take on four different values, depending on the values of the pair .
In our analysis, we set (p 00 , p 01 , p 10 , p 11 ) = (0, 0, 0, p). This corresponded to the prior belief that if the interaction effect between two factors exists in a model, the main effects of the two factors must be included in the same model.
Our study was conducted in two stages. At the first stage, all 399 candidate markers and gender were the candidate variables in SSVS, but interactions were not considered. At the second stage, SSVS was applied to the same sib-pair responses with the top 30 candidate variables selected from the first stage and their interactions as the candidate variables. Among the third were the gender and 29 markers. This brought the total number of candidate variables in SSVS to 465. We chose only the top 30 variables from the first stage for two reasons. First, it is reasonable to assume that only a few linked loci exist and they should be contained in the top 30. Second, this is the maximum size that the current SSVS algorithm handles comfortably in the second stage.
Step-wise regression In order to compare the traditional method to our method, we used a step-wise method based on Akaike information criterion (AIC) [6] to select a formula-based model, which was implemented under R, the "GNU S". The details of this method can be found in the R manual [7].

Results
Only the first of the 100 simulated data sets was used.  Table 1 shows the top 30 markers, a marker from chromosome 7 is rated as most significant, and there are seven, four, and two markers from chromosomes 7, 15, and 21, respectively. The variable gender was ranked as 15 th . Table 2 shows the most significant 20 markers obtained from the univariate LS regression and from the step-wise regression. These markers were very much evenly distributed in all chromosomes.
When we considered the results from the univariate regression, a marker from chromosome 13 was most significant. One each from chromosome 7 and chromosome 15 were only marginally significant; none from chromosome 21 (where a linked marker was located) are in the top 20 most significant markers. Similar results were obtained when backward and forward step-wise regression methods were used. Among the top 20, only two markers were from chromosome 15, and one each from chromosomes 7 and 21. Also, these two traditional methods failed to locate the gender effect as significant. The result from the second stage SSVS is shown in Figure  3. The marginal posterior probabilities of the interaction effects are displayed. The existence of the gender-gene interaction on chromosome 21 is clear.
The same analysis was also carried out on the third simulated data set and similar results were obtained.

Conclusion
We showed that SSVS is a powerful method in identifying linked markers using the Haseman-Elston method, even for weak effects. SSVS is very effective because it does a smart search over the entire model space, while the frequentist best subset model selection procedures are constrained by computing power required to examine all candidate models. The former can work on problems with many more candidate variables, which is essential when interaction effects are studied. By using the prior structures that reflect the relation among the candidate variables, SSVS can accommodate a good number of candidate markers as well as their interactions. The two-stage strategy used in this study worked well. It identified the chromosomes of the linked markers in the first stage and the interaction effects were located in the second stage. Both univariate regression and the step-wise regression failed to identify the chromosomes of the linked markers.

Discussion
One thing found to be interesting was that when we also use the squared-difference as response for comparison, its false positives did not overlap with those of cross-product.
As we can see in Figure 4, the red strip covers those unlinked markers with high posterior probability when squared-difference is used, the posterior of these markers when cross-product are used are just average. So the information obtained from these two responses is complementary. This result is not a surprise because many recent works [8] have proposed the use of both the squared-sum and squared-difference as responses and combined the results of two regressions together in drawing the inference. As a special case, the cross-product response weights these two equally. For a comprehensive review on these "new" Haseman-Elston methods, see Feingold [8]. A natural extension of the methods proposed in this paper is to combine the posterior probability from regressions on both squared-sum and squared-difference. We will investigate this in our future research. Gene × gene and gene × gender interaction effects