Send Orders of Reprints at Reprints@benthamscience.net Gene-based Genomewide Association Analysis: a Comparison Study

The study of gene-based genetic associations has gained conceptual popularity recently. Biologic insight into the etiology of a complex disease can be gained by focusing on genes as testing units. Several gene-based methods (e.g., minimum p-value (or maximum test statistic) or entropy-based method) have been developed and have more power than a single nucleotide polymorphism (SNP)-based analysis. The objective of this study is to compare the performance of the entropy-based method with the minimum p-value and single SNP–based analysis and to explore their strengths and weaknesses. Simulation studies show that: 1) all three methods can reasonably control the false-positive rate; 2) the minimum p-value method outperforms the entropy-based and the single SNP–based method when only one disease-related SNP occurs within the gene; 3) the entropy-based method outperforms the other methods when there are more than two disease-related SNPs in the gene; and 4) the entropy-based method is computationally more efficient than the minimum p-value method. Application to a real data set shows that more significant genes were identified by the entropy-based method than by the other two methods.


INTRODUCTION
Single nucleotide polymorphism (SNP)-based genomewide association studies (GWAS) have been a popular and successful method to identify disease-related SNPs. However, this approach has much lower power when the number of SNPs increases and SNPs are correlated, especially when their effect sizes are small and only their cumulative effect is associated with a disease. Gene-or region-based analysis may have higher power to identify the causal variants that affect the complex disease, because it takes into consideration the correlations among SNPs within a single gene.
The simplest method for gene-based analysis is the SNPbased method, in which each genotyped SNP is tested for association, and multiple testing corrections based on the Bonferroni procedure are applied to control the type-I error rate. The most widely used single SNP-based association test method is Cochran-Armitage trend test (CATT) which has high power under additive and multiplicative disease models but much low power under recessive disease model [1][2][3][4]. The genotypic test based on a 2 3 contingency table is robust to different disease models [5]. Some other innovative methods include entropy-based method which is generally as good as or even more powerful than the genotypic test [5,6]. The SNP-based method for gene-based analysis has low power when the causal variants are highly correlated with one or more genotyped SNPs and when the causal SNPs are not genotyped. The power of the SNP-based method can be improved by combining the information from neighboring SNPs within a single gene. Several methods have been developed to analyze multiple SNPs within the same gene simultaneously. These methods include Fisher's method for combining p-values by a logarithm function of p-values and the minP (minimum p-value) or maxT (maximum test statistics) method in which the significance level can be determined by the observed p-value. However, the empirical pvalue must be calculated by using permutation, because the limiting distributions of Fisher's statistic and minP (maxT) statistic are unknown under the null hypothesis that the gene is not associated with the disease.
Another alternative method to combine multiple SNPs is to do multivariate tests. Chapman and Whittaker proposed a multivariate score test statistic that is equivalent to the score test for the logistic regression model [7]. Another test statistic based on an empirical Bayesian model for the parameters was similar to the above multivariate score test statistic [8]. Wang and Elston proposed a test statistic using a weighted Fourier transform of the genotypes to reduce the test degrees of freedom [9]. Chapman and Whittaker compared the above five methods by simulation studies, and they found that the minP (maxT) and Goeman's method perform well over a range of scenarios [7].
For the minP (maxT) method, a Monte Carlo (MC) method can be used to evaluate the empirical p-values based on approximating the joint distribution of the test statistics by an MC-sampling approach. This is computationally feasible compared with a permutation method [10]. An entropybased test statistic was recently proposed to test gene-disease association based on the joint genotypes on multiple SNPs within a gene and a cluster-based analysis method was used to reduce the degrees of freedom of the test statistic [11].
In this study, we compare three methods, namely the single SNP-based method, the maxT method with MC sampling to estimate the empirical p-value, and an entropy-based method, by simulation studies and real data analysis. We start with a detailed description of each method, followed by simulations and real data analysis.

MaxT (or minP) Method with Monte Carlo Sampling
Much of what follows in the section below is adapted from Lin [10]. Consider one gene with m genic SNPs, each with two alleles. , and n is the sample size. The test statistic for the j-th locus within this gene is defined as . This test statistic follows an 2 distribution with j r degrees of freedom, The test statistics (T 1 , T 2 ,…, T m ) may be correlated due to linkage disequilibrium among SNPs within one gene. The pvalues evaluated by using the actual joint distribution of (T 1 , T 2 ,…,T m ) can be computationally intensive. Lin [10] proposed an MC method to approximate the actual joint distribution to evaluate the empirical p-values by MC sampling.
The MC method defines , and G 1 , G 2 ,…,G n are independent, standard, normal, random variables that are independent of the data.
where is the preset significance level, then the null hypothesis that this gene is not associated with the disease is rejected.

Entropy-based Test Statistic and Genotype Grouping via Penalized Entropy
For one gene with m genic SNPs, there is a total of 3 m joint genotypes. However, the real number of joint genotypes is much less. Denote the number of observed joint genotypes for one gene by s (s<3 m ). Let the frequencies of the i-th joint genotype in cases and controls, respectively. Then the entropy-based test statistic for testing the association between this gene and a disease is as follows [11]: where )] log( , is the number of cases and controls, and Under the null hypothesis that there is no association between this gene and a disease, T gene follows a central 2 distribution with m-1 degrees of freedom. When the number of genic SNPs is high, the degree of freedom increases so that the power will decrease. To increase the power, the rare joint genotypes could be grouped into common ones by using the penalized entropy measure (PEM) [11]: where m k is the number of k-th joint genotypes. The joint genotype set with maximum value of I will be the corresponding common joint genotype. To do so, we first sort all joint genotypes in descending order, according to their frequencies. Then we calculate the PEM by adding one joint genotype to the present joint genotype set. If the PEM begins to decrease when the k-th joint genotype is added to the current set, the common joint genotype set will include the former k-1 joint genotypes.
Once the grouping threshold is determined, we can proceed to calculate the similarities between one rare-joint genotype with frequency less than the threshold and all common genotypes and then group it with the common one that is the most similar.

SIMULATION STUDIES
We evaluated the performance of the three methods described above by using simulation studies. We simulated case-control samples in two methods: one using a linkagedisequilibrium (LD)-based method similar to methods in [10,11], and the other using an MS program developed by Hudson [12] that is similar to programs developed by Tzeng [13]. Although we will not discuss the LD-based simulation method here (see [11]), we describe below the detailed process to generate samples by the MS program.

MS Program
We used the MS program developed by Hudson [12] 10 4 , where the parameter g is the probability of crossover per generation between the ends of the haplotype locus being simulated; the scaled mutation rate for the simulated haplotype region, bp n e / 4 μ , is set to be 4 10 6 . 5 for the region of simulated haplotypes; and the length of sequence within the region of simulated haplotypes, n sites, is 10 kb. Similar parameter settings can be found in other studies [10,12,13]. We set the number of SNP sequences in the simulated sample to 100 for each gene and run the MS program to generate the haplotype sample on the basis of these parameter settings. Then we randomly selected a segment of 10 adjacent SNPs as a haplotype. The two haplotypes are randomly drawn from the simulated sample containing 100 10-SNP haplotypes and are paired to form an individual genotype.

Phenotype Simulation
In reality, we do not know the true functional mechanism for a given gene, so it is difficult to simulate the true functional variants and the true functional mechanism within a gene [13]. Here, we considered three scenarios to mimic the situation of a complex disease in which there is one, two, or three disease-related SNPs within a given gene. For cases with two or three disease-related SNPs, complex interactions occur among the SNPs. Here we briefly illustrate how the disease phenotypes are simulated.
Scenario 1. Let f 0 , f 1 , f 2 be three penetrances of three genotypes. Denote 1 = f 1 /f 0 , 2 = f 2 /f 1 as the genotype-relative risks (GRRs). Let p be the disease allele frequency, and denote the disease prevalence as k. Then the three penetrances can be calculated for an additive, dominant, or recessive disease model ( Table 1). We omit a multiplicative model, because the results of that model are similar to those from the additive model. Once f is determined, the case/control status is simulated according to a Bernoulli distribution, with the probability of success f conditional on the observed genotype data.  For a disease model with two or three interactions of disease-related SNPs within a single gene (Scenarios 2 and 3), we follow the cases given in [14].

Scenario 2.
For the two-locus-interaction disease model, we denote the two-locus genotypes as (G A , G B ) (0, 1, 2) 2 , which represents the number of risk alleles at each diseaserelated SNP A and B. The two-locus-interaction disease model is as follows: where is the baseline effect, and is the genotypic effect.

Scenario 3.
For the three-locus-interaction disease model, we denote the three-locus genotypes as (G A , G B , G C ) (0, 1, 2) 3 , which represents the number of risk alleles at each disease-related SNP A, B, and C. The three-locus-interaction disease model is as follows: where and are the same as in Scenario 2. Once the disease-related SNPs are determined, the case-control status can then be simulated according to a multinomial distribution conditional on the observed genotype data.
We simulated data sets with 400 cases and 400 controls or 800 cases and 800 controls. For the evaluation of type one error rate, we simulated data sets using both LD-based and MS methods but for power, we only used MS method because it can better mimic the biological data. For each data set, we applied the three methods described above. The type-I error rate was estimated based on 1000 replicates, and the power was estimated based on 100 replicates at a significance level of 0.05. For the maxT method, the empirical pvalue was obtained based on 10,000 normal samples.

REAL DATA ANALYSIS
To compare the three methods, we applied them to a large-scale, candidate-gene study. The data set contains 225 cases and 585 controls on 190 candidate genes in a genetic association study of preeclampsia [15]. We removed SNPs with minor allele frequencies less than 0.05 and focused on the remaining 819 SNPs. We also removed 27 genes carrying only one SNP. Similar to [11], we used a nominal level of 0.005 for the gene-based method and 0.005 dividing the number of SNPs within each gene for SNP-based method.
( Table 2) lists the p-values of significant genes and SNPs for the three methods. The genes and SNPs that showed significant effects are formatted in bold. The entropy-based method identified seven significant genes among the 190 genes evaluated. The single SNP-based method identified three significant genes, and the maxT method identified one significant gene. Thus, the gene-based entropy method identified the most number of significant genes.

SIMULATION RESULTS
( Table 3) presents the empirical type-I error rates of the single-SNP, maxT, and entropy-based methods based on the MS program and LD-based method. From (Table 3), we see that the maxT and entropy-based methods control the type-I error rate quite well. The latter also controls as the sample size increases. However, the single-SNP method has a much lower type-I error rate, which means that this method may have lower power. We also simulated 10 SNPs with r 2 =0.9, 0.5, and 0 within one gene by using the LD-based method and found that all three methods control the type-I error rate well.
( Table 4) presents the estimated power of the SNP-based, maxT, and entropy-based methods for one disease-related SNP within a single gene. The maxT method appeared to be the most powerful among the three methods. The entropybased method had lower power than the maxT method, because when one disease-related SNP occurs within a gene, the cluster number in the entropy-based method will be large, so that the degree of freedom of the test statistic in equation (1) is high. This will affect the power of the entropy-based method.
(Tables 5 and 6) present the estimated power of the three methods for situations in which two or three diseaserelated SNPs occur within a single gene. The entropy-based method appeared to be the most powerful method, and the single SNP-based method was the least powerful. This makes sense because when there are two or three interacting-disease-related SNPs within one gene, the cluster number of the observed joint genotypes will be small. Thus, the degrees of freedom of the test statistic in equation (1) will be small, which will improve the power of the entropybased method.

DISCUSSION
We have compared three gene-based association approaches by conducting simulation studies and one real data set analysis. Simulation results show that 1) all three methods effectively control the type-I error rate; 2) the single SNP-based method is very conservative; 3) when there is one disease-related SNP within a gene, the maxT method is the most powerful; 4) when there are two or three diseaserelated SNPs within a gene, the entropy-based method is the most powerful. Real data analysis shows that the entropybased method identifies more significant genes than do the other two methods. In addition, we have compared the computing time used by the three methods and found that the entropy-based method is computationally more efficient than the maxT method.
Given the unknown number of causal SNPs as well as the complex structure among/between causal and non-causal SNPs within the gene, and the complex underlying disease gene actions, the relative performance of different approaches for gene-based association tests strongly depends on different realistic scenarios. Considering genes as testing units, sometimes we have to move forward to pursue gene-based interactions to get better biological insights into the etiology of complex diseases [16]. As new approaches are increasingly developed, we believe that no single approach is universally superb to others [4]. We suggest that users explore as many different approaches as possible and choose the best one based on their biological experience.
Rare variants may play an important role to explain the missing heritability of complex disease in post-GWAS research. The correlations between rare and common SNPs and among rare variants are generally weak [17], and the number of causal rare SNPs each with moderate or large effect sizes may be large [18]. The novel statistical or computational methodologies for analyzing rare variants focusing on genes are urgently needed with the availability of large scale exome or wholegenome sequencing data [19]. The relative performance of these approaches for gene-based association tests is worthy of further investigation.     Data were obtained using the entropy-based method. d Data were obtained using the single-SNP-based method.