A Likelihood Ratio Test for Genome‐Wide Association under Genetic Heterogeneity

Most existing association tests for genome‐wide association studies (GWASs) fail to account for genetic heterogeneity. Zhou and Pan proposed a binomial‐mixture‐model‐based association test to account for the possible genetic heterogeneity in case‐control studies. The idea is elegant, however, the proposed test requires an expectation‐maximization (EM)‐type iterative algorithm to identify the penalised maximum likelihood estimates and a permutation method to assess p‐values. The intensive computational burden induced by the EM‐algorithm and the permutation becomes prohibitive for direct applications to GWASs. This paper develops a likelihood ratio test (LRT) for GWASs under genetic heterogeneity based on a more general alternative mixture model. In particular, a closed‐form formula for the LRT statistic is derived to avoid the EM‐type iterative numerical evaluation. Moreover, an explicit asymptotic null distribution is also obtained, which avoids using the permutation to obtain p‐values. Thus, the proposed LRT is easy to implement for GWASs. Furthermore, numerical studies demonstrate that the LRT has power advantages over the commonly used Armitage trend test and other existing association tests under genetic heterogeneity. A breast cancer GWAS dataset is used to illustrate the newly proposed LRT.


Introduction
Common and complex diseases (or traits) are often genetically heterogeneous in aetiologies (Lander & Schork, 1994;Zhou & Pan, 2009). Some well-known complex diseases with genetic heterogeneity include asthma, breast cancer (Hall et al., 1990;Wooster et al., 1994;Turnbull et al., 2010), and diabetes (Hattersley, 1998;Sladek et al. 2010). As in Zhou & Pan (2009), this paper considers the situation when a complex disease (or trait) is caused by mutations in multiple unlinked loci, commonly referred to as locus heterogeneity (Ott, 1999;Abreu et al., 2002;Fu et al., 2006). As a consequence of genetic heterogeneity, the population of individuals with disease may be decomposed into various latent sub-populations, each with disease caused by mutations at different loci (or their combinations). Most of the existing association tests for population-based case-control studies, e.g. GWAS, are based on comparing the mean genotype scores (e.g. the Armitage trend test (ATT)) between the case and control groups, which are not efficient in the presence of genetic heterogeneity. Zhou & Pan (2009) showed that it can be beneficial to use methods that account for genetic heterogeneity for testing association in a case-control study.
Similar to admixture mapping in linkage analysis (Smith, 1963;Abreu et al., 2002;Fu et al., 2006), Zhou & Pan (2009) proposed a binomial mixture model to account for genetic heterogeneity and developed a modified likelihood ratio test (MLRT) for a single locus (Fu et al., 2006). They also consider two methods to combine single-locus-based MLRTs across multiple loci in linkage disequilibrium to boost power when causal single nucleotide polymorphisms (SNPs) are not genotyped (Zhou & Pan, 2009). They illustrated, with a wide spectrum of numerical examples, that the proposed MLRT tests are more powerful than some commonly used association tests under genetic heterogeneity. Following Zhou and Pan, we define the genetic score X as the number of the minor alleles at a single locus for a subject. Zhou and Pan (2009) assumed that the genetic score X H in a healthy control subject follows a binomial distribution, that is where B 2 (g , θ b ) = ( 2 g )θ g b (1 − θ b ) 2−g and θ b represents the minor allele frequency (MAF) on that specific locus of the control subject. On the other hand, under genetic heterogeneity, the genetic score for a diseased subject, X D , follows a simple two-component mixture binomial distribution, P (X D = g ) = α 1 B 2 (g , θ b ) + α 2 B 2 (g , θ), g = 0, 1, 2; 0 ≤ α 1 , α 2 ≤ 1, α 1 + α 2 = 1, where θ represents the probability of having the minor allele on one chromosome for a sub-group of cases with disease associated with the minor allele. They adopt a two-step procedure for parameter estimation. First, a maximum likelihood estimate (MLE) of θ b is obtained based only on the control sample. Then, fixing the estimated θ b at its MLE derived from the control-group data, maximum penalised likelihood estimates of other parameters in the mixture model are obtained using an expectation-maximization (EM)-type algorithm (Li et al., 2009). Subsequently, the penalised MLEs from the EMstep are plugged into a likelihood ratio to form a test statistic to detect the association between the marker genotypes and the disease status. Finally, they proposed a permutation procedure to obtain the p-value of the association test. Zhou and Pan's idea is applicable to an association study for a limited number of candidate markers; however, there are several challenges in applying their proposed method to genome-wide association studies (GWASs). First, the computation of their proposed MLRT for a vast number of SNPs in a typical GWAS would be very intensive. Since the penalised MLEs are obtained by an EM algorithm for maximization of the penalised mixture likelihood, there are known complexities and caveats associated with the EM or other iterative methods for identifying MLEs and penalised MLEs in mixture models including the challenges in selecting multiple starting points for parameter estimation. Moreover, the p-value of the MLRT is proposed to be attained by permutation, which is also difficult to apply directly to detect the SNP-disease association in GWAS with a vast number of SNPs, where the significance level is usually set to be less than 10 −6 . In addition, it is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. The simple binomial mixture model with two components in Equation (2) may be too simple to capture the complex heterogeneity for many complex diseases. A more general form of binomial mixture model can be written as follows: where η = (η j ) j ≤J , η j = (θ j , α j ) T , j = 1, . . . , J , and θ i = θ j if and only if i = j . In particular, for many of the complex diseases with genetic heterogeneity, it is likely that J is quite large. Since it is hard to know the number of the subpopulations J under genetic heterogeneity, it is desirable to have a new test that is applicable without the need to know the exact value of J while allowing J ≥ 2.
In this paper, we developed a LRT for GWASs based on the more flexible binomial mixture models in (3). It is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. Thus, we assume that the genetic score in the case group, X D , follows a general binomial mixture distribution in (3) which allows the possibility of a large and unknown J . The proposed LRT overcomes the above-mentioned challenges of using Zhou and Pan's method for testing association of a vast number of SNPs in a typical GWAS. In particular, we derived the closed-form formula for the LRT statistic even though the MLEs of parameters in the binomial mixture model are non-regular with loss of identifiability (Liu & Shao, 2003). We further derived the simple closed-form asymptotic null distribution of the LRT which avoids the intensive numerical calculations, such as the EM-based iterations for identification of MLEs and the permutations for evaluation of p-values. Additionally, the LRT can be implemented without the requirement of knowing the number of components J in the mixture model (3). We conducted extensive simulation studies to show that the LRT has power advantages over Armitage trend test (ATT) and some other association tests under genetic heterogeneity. We applied our test to a real dataset from a breast cancer GWAS to illustrate that it can achieve a much smaller p-value than some commonly used tests when there is evidence of genetic heterogeneity. Thus, the proposed LRT might be used to scan SNPs in GWAS to make novel discoveries by taking account of genetic heterogeneity.

Notation and Setup
We focus on detecting marker-disease association at a single locus with two alleles A and a , such as a SNP in a case-control GWAS. Suppose m + controls and n + cases are sampled from the population. For each SNP, the genotype frequencies in a case-control study can be summarised as in the following 2 × 3 table (Table 1).
Let the genetic score X H and X D denote the number of minor alleles, say a , at a single locus for a healthy control and a diseased case, respectively. It is clear that X H = 2m 2 + m 1 , X D = 2n 2 + n 1 . Similar to Zhou and Pan's setup, we assume that under the null hypothesis, both X H and X D have the same binomial distribution B 2 (g , θ b ) as described in Equation 1. As in Zhou & Pan (2009), X H is assumed to have a binomial distribution under H 1 . Under the alternative hypothesis of genetic heterogeneity, we assume that X D has a mixture distribution as described in Equation 3. This last assumption is worthy of further comments. On the one hand, it is possible to have J > 2 in Equation 3 under H 1 both in practice and in theory; thus, it is conceptually desirable to allow J > 2 in Equation 3. On the other hand, for likelihood inference, it is not necessary to have J > 2 in the model in order to achieve the maximum of the likelihood because the model is actually saturated with J = 2. In other words, for a given dataset, posing a model (3) with J = 2 or with J ≥ 2, the testing results from the LRT are not going to be different. In fact, as will be seen in the next section, our proposed LRT actually has the "non-parametric" nature because it has a closed-form formula, with a simple null distribution shown to be valid,; thus, it will be valid for testing any alternative models including the common models and those under heterogeneity. In this paper, we will establish that the test is actually a LRT under the specified setup motivated by the elegant work of Zhou & Pan (2009) and by the fact that the LRT has wellknown optimalities in terms of statistical power and efficiency.

Mixture Binomial and Maximum Likelihood
Assuming the setup in the previous sub-section, under H 0 , using the notation in Table 1 and denoting the true value of θ b as P 0 , the MLE of P 0 for the overall combined case-control data in Table 1 iŝ Thus, the binomial likelihood function for the overall combined case-control data evaluated atp 0 , L 0 , is given by where B 2 (g ,p 0 ) = ( 2 g )p g 0 (1 −p 0 ) 2−g andp 0 is defined in (4). Following Zhou & Pan (2009), in the control group, the genetic score X H is assumed to follow a binomial distribution under the alternative hypothesis, say Table 1 The genotype frequencies for case-control data of a SNP.

AA a A a a Total
Case Using the notation in Table 1, the MLE of P H within the healthy control group only is given bŷ The binomial likelihood function of the healthy controls data evaluated atp H , L H , is given by Similarly, in the case group, if the genetic score X D has the distribution B 2 (g ; P D ), the MLEp D of P D within the diseased case group only would bê However, as in Zhou & Pan (2009), we assume that under genetic heterogeneity, the cases can be divided into multiple latent sub-populations. Thus, under the alternative hypothesis of genetic heterogeneity, we assume that X D has a mixture distribution as described in Equation 3. It can be shown that (see Appendix I), using the above notation, the maximum of the mixture likelihood for X D has an explicit formula: The derivation of the above equation can be found in Appendix I. It is also clear from the derivation in Appendix I that the mixture likelihood function of the parameter vector η = (θ j , α j ) j ≤J in the mixture model (3) can have many local maxima due to the lack of identifiability in parameters (Liu & Shao, 2003). Nevertheless, the supremum of the mixture likelihood L D for X D has a single unique value for each dataset and can be obtained from the explicit formula in Equation (10). In the typical case-control study design, L D is independent of L H .

The LRT
Using the maximum of the likelihood L 0 , L H and L D in Equations 5, (8) and (10), respectively, we can write down the explicit formula of the log-LRT statistic as follows: No iterative numerical maximization of the mixture likelihood function is needed for the evaluation of the LRT statistic in (11). Thus, the LRT statistic is easy to compute even for GWAS. It is known that the LRT statistics for testing homogeneity in mixture models often have complicated asymptotic distributions that typically lack closed-form representations. However, the above statistic 2λ N can be shown to have an explicit form of asymptotic distribution under the null hypothesis. More specifically, under H 0 , as n + → ∞ and m + → ∞, we have where χ 2 d denotes a chi-squared distribution with d degrees of freedom, d = 1, 2. Although the above asymptotic null distribution can be derived from general results such as those in Chernoff & Lander (1995), Chiano & Yates (1995) or Liu & Shao (2003), an elementary and detailed direct derivation of the above asymptotic null distribution is given in Appendix II for readers who are interested in a direct derivation based on first principles.
It is worth pointing out that our extensive numerical simulations discussed in the next section indicate that the simple asymptotic null distribution in (12) approximates the exact finite sample null distribution very well. The asymptotic formula is only slightly conservative. Therefore, the p-values of the LRT can be easily read off from the above simple closedform asymptotic null distribution. For example, given any observed data in Table 1, one can first evaluate the value of 2λ N in (11) and then can obtain the p-value using the following simple command in the widely used R-platform: Last but not least, it is well known that the LRT generally has better power than other ad hoc tests. Thus, it should not be a surprise to see that the LRT can be more powerful than other commonly used tests which ignore the genetic heterogeneity that exists for many common complex diseases such as breast cancer. Finally, to implement the LRT, there is no need to identify the exact number of mixture components J in (3), which is desirable because J is hard to determine in practice.

Type I Errors
The LRT has an explicit asymptotic distribution under H 0 . Consequently, it is convenient to evaluate the p-value and type I errors. We conducted comprehensive simulations to compare the empirical type I error of the LRT to the nominal significance level ranging from 10 −2 to 10 −8 . In the Monte Carlo simulations, the genotype data for both the control group and the case group were generated from the same binomial distribution B 2 (g ; θ b ), where θ b takes some fixed value P 0 , which represents the MAF. A number of simulation setups, which varied over a range of MAF and sample size, were selected. The control and case sample sizes are set to be equal. The nominal significance levels were taken to be 10 −2 , 10 −3 , 10 −4 , 10 −5 , 10 −6 , 10 −7 and 10 −8 , respectively. For each setup, 10 11 samples were generated. We found that the empirical type I error is slightly smaller than the nominal level, but they are extremely close to each other. Thus, using the asymptotic null distribution for the LRT is valid. For illustration, an example with θ b = 0.4 and sample size n + = m + = 1000 is shown in Table 2.

Power Comparison
The significance level of the association test is usually set very small for GWASs. For example, the genome-wide significance level of 5×10 −8 is being increasingly used for arrays that contain one million SNPs. The most commonly used association tests for GWAS include ATT and the χ 2 2 test, both applicable for testing association in a 2 × 3 table between the case-control status and the three genotypes, as illustrated in Table 1. Accordingly, we designed simulation studies to evaluate and compare the powers of the LRT, the ATT and the χ 2 2 test when the significance level is set to be 5×10 −8 . Note that, the MLRT of Zhou & Pan (2009) was not included for comparison due to its severe computational challenge when the significance level is very small. In the first set of Monte Carlo simulations, the control sample was generated from a binomial distribution B 2 (g , θ b ); the case sample was generated from a two-component mixture binomial distribution as Table 2 Empirical type I error and nominal significance level at θ b = 0.4 and n + = m + = 1000.
Twenty thousand replicate datasets of n + = m + = N controls and cases were simulated for each of the eight simulations setup and the empirical power for each test is shown in Table 3. The simulation results indicate that the LRT has power advantage over the ATT and the χ 2 2 test under genetic heterogeneity.
Similar power advantages of the LRT over other tests are also observed when the alternative mixture model has three components (J = 3), as demonstrated in Table 4, where θ 3 for the cases is set as equal to θ b for the control group.
Note that the ATT, also called Cochran-Armitage trend test (CATT) by many researchers, has good power only when the disease risk of the genotypes AA, Aa and a a is monotone increasing or decreasing under the alternative hypothesis (Armitage, 1955;Freidlin et al., 2002). Thus, ATT can have very low power when there is a violation of a linear trend in the disease risk across the ordered genotypes AA, Aa and a a , as in the case of both setups #3 and #5 in Table 3.
It is clear from the power simulations across the multiple simulation setups that the LRT can be much more powerful than the commonly used ATT and the χ 2 2 test in GWAS in the presence of genetic heterogeneity.

A Breast Cancer GWAS
Breast cancer is the most common cancer among women. Many genes on different chromosomes that underlie breast cancer have been identified including many well-known studies conducted two decades ago (Hall et al., 1990;Wooster et al., 1994). Many more genetic variants underlying breast cancer are still being discovered nowadays; thus, there is little doubt about the existence of genetic heterogeneity in the case of breast cancer. For illustration, we applied the newly proposed LRT to a breast cancer GWAS dataset. In particular, Turnbull et al. (2010) conducted a GWAS to identify breast cancer susceptibility alleles. They studied 582,886 SNPs in 3659 breast cancer cases and 4897 controls in the first stage, and evaluated promising SNPs that were identified in stage 1 in a second stage with 12,576 cases and 12,223 controls. In the paper, they reported five new susceptibility SNPs with summary genotype data of the five SNPs made publicly available. A literature search indicates that four of the five SNPs (rs1011970, rs10995190, rs704010 and rs614367) have been independently confirmed by other studies since the publication of their GWAS results in 2010 (Peng et al., 2011;Lambrechts et al., 2012). We evaluated the p-values of the LRT, ATT and χ 2 2 test for these four SNPs for comparison. The results are summarised in Table 5.

Discussion
In the analysis of GWAS data, potential latent genetic heterogeneity has largely been ignored by researchers. Zhou & Pan (2009) first proposed mixture models to account for genetic heterogeneity. However, for the analysis of a vast number of SNPs in GWAS, the MLRT of Zhou and Pan has major computational challenges. In this paper, using a more general binomial mixture model, we have derived a LRT for casecontrol association studies that improve the MLRT by Zhou and Pan on computational efficiency and multiple other aspects. In particular, the LRT statistic has a simple closed-form formula, which could avoid intensive computation, such as the EM algorithm for penalised MLEs. Additionally, we have derived an explicit asymptotic null distribution for the proposed LRT, which is convenient to obtain p-values even at a small significance level. Moreover, to perform the LRT, there is no need to decide the exact number of mixture components, which is convenient in practice. Therefore, the new LRT has computational advantages over the MLRT proposed by Zhou and Pan and is suitable for scanning SNPs in GWAS data.
As demonstrated by our numerical studies, in the presence of genetic heterogeneity, the LRT can be much more powerful than either Armitage's trend test or the χ 2 2 test, both of which are among the most widely used tests in GWAS. Given that most complex diseases are widely believed to be polygenic and have environmental components, genetic heterogeneity is a hallmark of complex diseases. As illustrated using the GWAS data for breast cancer, newly proposed LRT can be easily used for any GWAS data; thus, researchers can use the simple algorithm to scan their SNPs as a cost-effective way to potentially make novel and important discoveries using existing data already collected in the large number of GWAS. Given that there are already about 1000 published GWAS, and many more genome-wide studies are being planned and conducted, the new LRT has the potential to become one of the useful tests to scan the SNPs in these GWAS, maybe as a secondary analysis to account for genetic heterogeneity. Thus, the new user-friendly LRT can potentially be used to increase the impact of existing and future GWASs.
As indicated above,θ 1 can take any values in an interval; thus, there are infinitely many sets of solutions for the MLE. Thus, Equation 13 is proved. Next, we show that when 4n 0 n 2 ≤ n 2 1 , First, we show that for any fixed η, Using the inequality log x ≤ x − 1, we get It is straightforward to verify that 2 g =0 n g P η (X D = g ) B 2 (g ,p D ) − 1 = n + 4n 0 n 2 − n 2 1 (2n 0 + n 1 ) 2 (2n 2 + n 1 ) 2 Therefore, when 4n 0 n 2 ≤ n 2 1 , and for any η Finally, it is obvious that This finishes the proof (15), thus also (10).